Image and Video Generation Model Release Timeline - Major Text-to-Image and Text-to-Video Models across Vendors

First Published:
Last Updated:

This article is a cross-vendor installment in my history-and-timeline series. Where my earlier model timelines each tracked a single provider - Anthropic's Claude, OpenAI's GPT, and Amazon's Nova - this one steps across vendors to build a single release timeline for the image generation and video generation models that have shaped generative media since 2021, from OpenAI's original DALL·E and the open-weight release of Stable Diffusion through today's text-to-video models such as Sora, Veo, Runway, and Kling.

In this article I map the field into a landscape of vendors and access models, then build two chronological timelines - one for image generation models and one for video generation models - each row linked to the official announcement or the vendor's official page. I follow those with the evolution of capabilities and access (open weights, commercial APIs, native in-model image generation, text-to-video, and synchronized audio) and a neutral, vendor-by-vendor summary of where each major provider stands.

Companion articles on hidekazu-konishi.com:
The scope of this article is limited to publicly announced models documented on each vendor's official pages. Some notable systems - for example Google's Imagen and Parti, or Meta's Make-A-Video and Movie Gen - were published as research and never released as products; these are included but explicitly labeled as research announcements. Where a model's announcement date differs from its general availability, the announcement date leads the row and the availability is noted. This timeline covers image and video generation only; audio, music, and speech generation, and 3D generation, are out of scope and deliberately deferred to a possible future article. Pricing changes frequently and is omitted, and I do not reproduce benchmark scores or compare output quality - the goal is a neutral, dated record of what was announced and when.

This timeline primarily references the following official sources.

Overview — From the Diffusion Era and DALL·E to Today's Image and Video Models

Generative image models moved from research curiosity to everyday tool in a remarkably short window. OpenAI's original DALL·E, shown in January 2021, was one of the first systems to turn a sentence into an image, and DALL·E 2 (April 2022) made photorealistic text-to-image generation widely visible. In parallel, Google published Imagen and Parti (2022) as research models. The decisive shift for the broader ecosystem, however, came in August 2022, when Stable Diffusion was released with open weights: for the first time a capable text-to-image model could run on consumer hardware and be freely fine-tuned, which seeded a large open ecosystem and a wave of commercial products.

From there the image field advanced along two tracks. On one track, vendors packaged image generation into commercial creative tools and clouds - Adobe Firefly (2023), Amazon Titan Image Generator and Google Imagen 2 on their respective clouds (late 2023), and later Amazon Nova Canvas (2024). On the other track, image generation moved inside large multimodal models: OpenAI's GPT-4o native image generation (March 2025) and Google's Gemini 2.5 Flash Image (August 2025) generate images within a language model rather than through a separate diffusion pipeline. Alongside these, open-weight lines kept advancing - Stability AI's SDXL and Stable Diffusion 3.5, Black Forest Labs' FLUX.1 and FLUX.2, and Alibaba's open-weight Qwen-Image - while Midjourney matured as a subscription product.

Video generation followed a similar arc a year or two behind. It began with research systems - Meta's Make-A-Video and Google's Imagen Video (2022) - none released as products. Commercial text-to-video then arrived through Runway (Gen-1 and Gen-2, 2023) and Pika (2023). Stable Video Diffusion (November 2023) brought open weights to video. OpenAI's Sora, announced as a research preview in February 2024, launched publicly in December 2024, and a broad cohort followed: Google's Veo line, Luma Dream Machine and Ray, Kuaishou's Kling, MiniMax's Hailuo, Amazon's Nova Reel, and open-weight models from Tencent (HunyuanVideo) and Alibaba (Wan). By mid-2025 the frontier had moved to synchronized native audio in the video itself, with Google's Veo 3 and OpenAI's Sora 2. The sections below lay this out as two dated timelines, preceded by a landscape map of who builds what.

The Landscape of Image and Video Generation Models

Before the chronology, it helps to see the field by vendor. The table below groups the major providers by the image and video models they offer and their primary access form (open-weight download, hosted API, or consumer app). It is a snapshot for orientation, not an exhaustive catalog; the dated details are in the two timelines that follow.

* You can sort the table by clicking on the column name.
Vendor Image model(s) Video model(s) Primary access
OpenAI DALL·E, DALL·E 2, DALL·E 3, GPT-4o image (gpt-image-1) Sora, Sora 2 API + ChatGPT / Sora app
Google Imagen 2 / 3 / 4, Gemini 2.5 Flash Image Veo, Veo 2 / 3 / 3.1 Gemini app + Vertex AI API
Stability AI Stable Diffusion, SDXL, Stable Diffusion 3.5 Stable Video Diffusion Open-weight + API
Black Forest Labs FLUX.1, FLUX.2 Open-weight ([dev]/[schnell]) + API ([pro])
Midjourney Midjourney V1–V7 Video Model V1 Web app (and Discord)
Runway (image via Gen-3/Gen-4) Gen-1, Gen-2, Gen-3 Alpha, Gen-4 Web app + API
Luma AI Photon Dream Machine, Ray2, Ray3 Web app + API
Adobe Firefly Image (1–4) Firefly Video Model Creative Cloud apps + Firefly API
Amazon Titan Image Generator, Nova Canvas Nova Reel Amazon Bedrock API
Meta Emu (research) Make-A-Video, Movie Gen (research) Research (tech embedded in Meta AI features)
Kuaishou Kling Web app + API
MiniMax Hailuo App + API
Tencent HunyuanVideo Open-weight
Alibaba Qwen-Image Wan Open-weight
ByteDance Seedream Seedance App + API

One useful way to read the field is along two axes: modality (image vs. video) and access model (open-weight vs. closed API/app). Open-weight models can be downloaded and run or fine-tuned locally, while closed models are reached only through a hosted API or app. The table below places representative models on those two axes.
Access Model Image Generation Video Generation
Open-weight
(download, run / fine-tune locally)
Stable Diffusion / SDXL
Stable Diffusion 3.5
FLUX.1 / FLUX.2 [dev]
Qwen-Image (Alibaba)
Stable Video Diffusion
HunyuanVideo (Tencent)
Wan (Alibaba)
Closed
(hosted API or app only)
DALL·E / GPT-4o image
Imagen / Gemini Flash Image
Firefly (Adobe)
Nova Canvas (Amazon)
Midjourney
Sora / Sora 2 (OpenAI)
Veo / Veo 2 / Veo 3
Runway Gen-N / Luma Ray
Kling / Hailuo
Nova Reel (Amazon)
Note that these boundaries are not rigid: several vendors offer both open-weight and API tiers (Stability AI and Black Forest Labs), and some closed vendors also publish smaller open components. The timelines that follow record the first public announcement of each model regardless of access form.

Timeline of Image Generation Models

Here is the chronological timeline of major text-to-image (and image-editing) model releases across vendors. Each row links to the official announcement or the vendor's official page. Where a model was a research announcement rather than a released product, or where the announcement and general-availability dates differ, that is noted in the row.

* You can sort the table by clicking on the column name.
Date Event
2021-01-05 OpenAI DALL·E is introduced - one of the first systems to generate images from natural-language captions, presented as a research model based on a GPT-style architecture. The original model was not released as a public product. Source: DALL·E: Creating images from text.
2022-04-06 OpenAI DALL·E 2 is announced as a research preview with higher-resolution, more photorealistic text-to-image generation. It opened to the public without a waitlist on September 28, 2022 and gained an API later that year. Source: DALL·E 2.
2022-05-23 Google Imagen is published as a research model - a text-to-image diffusion model presented in a paper and project page. It was not released as a product under this name (the later commercial "Imagen" line on Vertex AI is a separate lineage). Source: Imagen (research).
2022-06-22 Google Parti is published as a research model - a text-to-image model using an autoregressive (rather than diffusion) approach, released as a research project page. Source: Parti (research).
2022-07-12 Midjourney enters open beta - a text-to-image service first accessed through a Discord bot, later moving to a dedicated web app; subsequent versions ran through V6 and V7. Source: Midjourney Updates.
2022-08-22 Stable Diffusion is released publicly with open weights - a latent text-to-image diffusion model from the CompVis group together with Stability AI, Runway, and others, under the CreativeML OpenRAIL-M license. This open-weight release is widely regarded as the turning point that put capable image generation on consumer hardware. Source: Stable Diffusion Public Release.
2022-11-24 Stable Diffusion 2.0 is released as open weights, with a new text encoder and depth-guided, inpainting, and upscaler variants. Source: Stable Diffusion 2.0 Release.
2023-03-21 Adobe Firefly is announced as a family of generative models for creators, beginning with a text-to-image model in a web beta and later integrated across Creative Cloud (the Photoshop Generative Fill feature followed in May 2023). Source: Adobe unveils Firefly.
2023-07-26 Stability AI releases SDXL 1.0 (Stable Diffusion XL) - a larger open-weight text-to-image model with a base-plus-refiner design, under the CreativeML Open RAIL++-M license. Source: SDXL 1.0 announcement.
2023-09-20 OpenAI DALL·E 3 is announced, natively integrated with ChatGPT (available to ChatGPT Plus and Enterprise from October 2023 and via the API from November 2023). Source: DALL·E 3.
2023-11-28 Amazon Titan Image Generator is announced in preview at AWS re:Invent - a text-to-image foundation model in Amazon Bedrock with invisible watermarking - and reached general availability on April 23, 2024. Source: Amazon Titan Image Generator (preview).
2023-12-14 Google Imagen 2 becomes generally available on Vertex AI - the commercial Imagen text-to-image line (distinct from the 2022 Imagen research paper), later extended to consumer surfaces. Source: Imagen 2 on Vertex AI is now generally available.
2024-05-14 Google Imagen 3 is announced at Google I/O, rolling out through ImageFX, the Gemini app, and Vertex AI over 2024. Source: Generative AI models Veo and Imagen 3.
2024-08-01 Black Forest Labs launches FLUX.1 in three tiers - FLUX.1 [pro] (API), FLUX.1 [dev] (open weights, non-commercial license), and FLUX.1 [schnell] (open weights, Apache-2.0). Source: Announcing Black Forest Labs.
2024-10-22 Stability AI releases Stable Diffusion 3.5 (Large and Large Turbo; Medium on October 29, 2024) as open-weight models under the Stability AI Community License. Source: Introducing Stable Diffusion 3.5.
2024-10-30 Recraft releases Recraft V3 - a text-to-image model that can also produce vector (SVG) output, available via app and API. Source: Recraft V3.
2024-12-03 Amazon Nova Canvas is announced at AWS re:Invent as the image-generation model in the Amazon Nova family on Amazon Bedrock, with built-in watermarking and content moderation. Source: Announcing Amazon Nova foundation models.
2025-03-25 OpenAI releases GPT-4o native image generation in ChatGPT (exposed in the API as gpt-image-1 on April 23, 2025), generating images inside a multimodal model rather than a separate diffusion pipeline and replacing DALL·E 3 as the default image generator. Source: Introducing 4o image generation.
2025-04-24 Adobe announces the Firefly Image Model 4 and Image Model 4 Ultra at Adobe MAX London, alongside general availability of the Firefly Video Model. Source: Adobe advances creativity with Firefly.
2025-05-20 Google announces Imagen 4 at Google I/O (with Fast and Ultra variants); the family reached general availability over the following months. Source: Announcing Veo 3, Imagen 4, and Lyria 2 on Vertex AI.
2025-08-08 Alibaba releases Qwen-Image as an open-weight (Apache-2.0) text-to-image and image-editing model. Source: Introducing Qwen-Image.
2025-08-26 Google announces Gemini 2.5 Flash Image (nicknamed "Nano Banana") - native in-model image generation and editing across the Gemini API, Google AI Studio, and Vertex AI. Source: Introducing Gemini 2.5 Flash Image.
2025-09-09 ByteDance releases Seedream 4.0 - a unified text-to-image generation and editing model available through its app and API surfaces. Source: Seedream 4.0 officially released.
2025-11-20 Google announces "Nano Banana Pro" (Gemini 3 Pro Image) - the next-generation native image generation and editing model, offered in preview across the Gemini app and Google Cloud. Source: Nano Banana Pro.
2025-11-25 Black Forest Labs launches FLUX.2 in [pro] and [flex] API tiers and an open-weight FLUX.2 [dev], for text-to-image plus multi-reference editing. Source: FLUX.2.

There may be slight variations in the dates in this timeline due to differences between the original announcement, a public release or general availability, and availability inside a specific app or cloud. Where these differ, the dates above prioritize the original vendor announcement.

The content posted here is limited to major releases considered essential for understanding the evolution of image generation models across vendors.
In other words, please note that the items on this timeline are not all image-model updates, but representative releases that I have picked out.

Timeline of Video Generation Models

Here is the chronological timeline of major text-to-video and image-to-video model releases across vendors. As with the image timeline, each row links to the official announcement or the vendor's official page, and research announcements are distinguished from released products, with announcement and public-availability dates separated where they differ.

* You can sort the table by clicking on the column name.
Date Event
2022-09-29 Meta publishes Make-A-Video as a research system - text-to-video generation announced as research, not released as a consumer product or API. Source: Make-A-Video (Meta AI Research).
2022-10 Google publishes Imagen Video as a research model - a cascaded text-to-video diffusion model released as a research project page rather than a product. Source: Imagen Video (research).
2023-02 Runway introduces Gen-1 - a video-to-video model that restyles existing footage, first offered as a research preview. Source: Runway Gen-1.
2023-03-20 Runway announces Gen-2 - text-to-video and image-to-video generation, opened via waitlist and generally available in the app from June 6, 2023. Source: Runway Gen-2.
2023-11-21 Stability AI releases Stable Video Diffusion - an open-weight image-to-video model (SVD and SVD-XT), released as a research preview. Source: Stable Video Diffusion.
2023-11-28 Pika 1.0 is launched - a text- and image-to-video product with a new web experience, from Pika (formerly Pika Labs). Source: Pika.
2024-02-15 OpenAI announces Sora as a research preview - a text-to-video (and image-to-video) model. Sora launched publicly as "Sora Turbo" on December 9, 2024 at sora.com and in ChatGPT. Source: Sora is here.
2024-05-14 Google announces Veo at Google I/O - a text-to-video model offered first as a private preview. Source: Generative AI models Veo and Imagen 3.
2024-06-06 Kuaishou releases Kling - a text-to-video model offered through a web app and later an API; subsequent versions (1.5, 1.6, 2.0, and 2.1) followed. Source: Kling AI.
2024-06-12 Luma AI launches Dream Machine - a text- and image-to-video model available through a web app (its underlying model line is later branded Ray). Source: Luma Dream Machine.
2024-06-17 Runway introduces Gen-3 Alpha - a higher-fidelity text-to-video, image-to-video, and text-to-image model; a faster "Turbo" variant and an API followed later in 2024. Source: Introducing Gen-3 Alpha.
2024-08-31 MiniMax releases Hailuo (video-01) - a text- and image-to-video model available via the MiniMax platform and app. Source: MiniMax video generation.
2024-10-04 Meta publishes Movie Gen as a research suite - foundation models for video generation, personalized video, video editing, and audio, announced as research rather than a shipped product. Source: Movie Gen (Meta AI Research).
2024-10-14 Adobe launches the Firefly Video Model in a limited public beta (previewed September 2024), reaching general availability on April 24, 2025. Source: Adobe launches Firefly Video Model.
2024-12-03 Amazon Nova Reel is announced at AWS re:Invent as the video-generation model in the Amazon Nova family on Amazon Bedrock; Nova Reel 1.1 (April 7, 2025) extended it to multi-shot videos up to two minutes. Source: Announcing Amazon Nova foundation models.
2024-12-03 Tencent releases HunyuanVideo - an open-weight text-to-video model published on GitHub under the Tencent Hunyuan Community License. Source: HunyuanVideo.
2024-12-16 Google announces Veo 2 - text- and image-to-video generation up to 4K, rolling out through Vertex AI and the Gemini app. Source: Veo 2 and Imagen 3 update.
2025-01-15 Luma AI releases Ray2 - a text- and image-to-video model, in the web app first and later the API. Source: Introducing Ray2.
2025-02-25 Alibaba releases Wan 2.1 - an open-weight (Apache-2.0) text- and image-to-video model family. Source: Wan 2.1.
2025-03-31 Runway announces Gen-4 - image- and text-to-video generation with improved reference-based consistency, in the app and API. Source: Introducing Runway Gen-4.
2025-05-20 Google announces Veo 3 at Google I/O - text- and image-to-video with native synchronized audio, reaching wider availability over the following months. Source: Gemini app updates at I/O 2025.
2025-06-11 ByteDance releases Seedance 1.0 - a text- and image-to-video model with multi-shot generation, on its Volcano Engine and consumer surfaces. Source: Seedance.
2025-06-18 Midjourney releases its V1 Video Model - an image-to-video ("animate") model in the Midjourney web app. Source: Introducing our V1 Video Model.
2025-09-18 Luma AI releases Ray3 - a text- and image-to-video model described as a reasoning model with native HDR output. Source: Ray3.
2025-09-30 OpenAI announces Sora 2 - a video-and-audio generation model, alongside a new Sora app. Source: Sora 2.
2025-10-15 Google announces Veo 3.1 - text- and image-to-video with richer native audio and additional creative controls in the Gemini API. Source: Introducing Veo 3.1.

As with the image timeline, dates may vary slightly between the original announcement, a public or general-availability release, and rollout inside a particular app or region; the dates above prioritize the original vendor announcement.

The content posted here is limited to major releases considered essential for understanding the evolution of video generation models across vendors, and does not attempt to list every model version.

Capability and Access Evolution

Beyond individual model dates, the field advanced through a handful of milestones that first appeared in one model and then spread across the ecosystem. The table below tracks when each capability or access model first became publicly available, and in which model or surface it first appeared.

* You can sort the table by clicking on the column name.
Capability / access milestone First Introduced First Model / Surface Notes
Text-to-image (broad research system) 2021-01-05 DALL·E One of the first systems to generate images from free-form text.
Open-weight text-to-image 2022-08-22 Stable Diffusion Weights released publicly under CreativeML OpenRAIL-M; enabled local use and fine-tuning.
Text-to-video (research) 2022-09-29 Make-A-Video / Imagen Video First wave of text-to-video, announced as research rather than products.
Image generation inside a chat product 2023-10 DALL·E 3 in ChatGPT Image generation integrated directly into a conversational assistant.
Commercial text-to-video product 2023 Runway Gen-2 / Pika Text- and image-to-video offered as paid apps to the public.
Open-weight video generation 2023-11-21 Stable Video Diffusion Later joined by open-weight HunyuanVideo (2024-12) and Wan 2.1 (2025-02).
Text-to-video at product scale 2024-12-09 Sora (public) Announced as a research preview in February 2024; opened to the public in December 2024.
Native image generation inside an LLM 2025-03-25 GPT-4o image Image generation within a multimodal model; Gemini 2.5 Flash Image followed (2025-08).
Instruction-based in-context image editing 2025 FLUX.1 Kontext / Nano Banana / Seedream Editing an existing image from a text instruction inside the generation model.
Synchronized native audio in generated video 2025-05-20 Veo 3 Video generated together with matching audio; Sora 2 followed (2025-09-30).

Two structural themes run through this table. First, access has repeatedly forked between open-weight models that anyone can download and closed models reached only through an API or app, with several vendors (Stability AI, Black Forest Labs, Alibaba, Tencent) offering open weights and others (OpenAI, Google, Midjourney, Runway) staying closed. Second, image generation has been migrating into general multimodal models: what began as dedicated diffusion systems (DALL·E, Imagen, Stable Diffusion) is increasingly a capability of a broader LLM (GPT-4o, Gemini), even as specialized open-weight diffusion and transformer models continue to advance in parallel.

Where Each Vendor Stands

The following is a neutral, vendor-by-vendor summary of each major provider's image and video generation models, based on their official announcements. It describes what each vendor offers and how it is accessed, without comparing output quality or ranking the models.

OpenAI

OpenAI's image line runs from DALL·E (2021) and DALL·E 2/3 to GPT-4o native image generation (2025, exposed in the API as gpt-image-1), which replaced DALL·E 3 as the default image generator in ChatGPT. For video, Sora was announced as a research preview in February 2024 and launched publicly in December 2024, and Sora 2 (September 2025) added synchronized audio and a dedicated Sora app. These models are reached through ChatGPT, the Sora app, and the OpenAI API. OpenAI's language-model line is covered separately in my OpenAI GPT Model Release Timeline.

Google

Google offers two commercial families: Imagen (Imagen 2 in December 2023, then Imagen 3 and Imagen 4) for images and Veo (Veo, Veo 2, Veo 3, and Veo 3.1) for video, both available through the Gemini app and the Vertex AI API. In 2025 Google also shipped native in-model image generation as Gemini 2.5 Flash Image ("Nano Banana") and its successor "Nano Banana Pro." Google separately published influential research models - Imagen and Parti (image) and Imagen Video (video) - that were never released as products, a distinction worth keeping in mind because the research "Imagen" and the commercial "Imagen" line share a name but are different lineages.

Stability AI

Stability AI anchors the open-weight image ecosystem: the original Stable Diffusion (August 2022) was the open-weight turning point, followed by SDXL (2023) and Stable Diffusion 3.5 (2024), released under successive open licenses. For video, Stable Video Diffusion (November 2023) brought an open-weight image-to-video model. Stability offers both open-weight downloads and a hosted API.

Black Forest Labs

Black Forest Labs, founded by researchers behind the original Stable Diffusion, launched the FLUX.1 family in August 2024 in three tiers - [pro] via API, and [dev] and [schnell] as open weights - and FLUX.2 in November 2025. The FLUX line is text-to-image with in-context editing (FLUX.1 Kontext), and combines an open-weight strategy for the smaller tiers with an API for the flagship.

Midjourney

Midjourney is a subscription image-generation service that grew from a Discord bot (open beta, July 2022) into a web app, iterating through versions up to V7 (2025). In June 2025 it added its first Video Model (V1), an image-to-video "animate" capability. Access is through the Midjourney web app and Discord.

Runway

Runway specializes in video and creative tooling. Its Gen-1 (video-to-video, 2023) and Gen-2 (text- and image-to-video, 2023) were among the first commercial text-to-video products, followed by Gen-3 Alpha (2024) and Gen-4 (2025), with an API for developers. Runway was also a co-creator of the original Stable Diffusion.

Luma AI

Luma AI's Dream Machine (June 2024) launched its video line, whose model is branded Ray; Ray2 (January 2025) and Ray3 (September 2025) followed, and Luma also offers the Photon image model. Access is via web app and API.

Adobe

Adobe's Firefly family (image beta March 2023, through Firefly Image Model 4 in 2025) is designed for integration across Creative Cloud, with features such as Photoshop Generative Fill. The Firefly Video Model (beta October 2024, general availability April 2025) added text- and image-to-video. Adobe positions Firefly around commercially safe training data and offers both app integrations and a Firefly Services API.

Amazon

Amazon offers image and video generation through Amazon Bedrock: the Titan Image Generator (preview 2023, GA 2024) and, in the Amazon Nova family, Nova Canvas (image) and Nova Reel (video), both announced at re:Invent 2024. These are delivered as APIs within Amazon Bedrock; the full Nova family is covered in my Amazon Nova Model Release Timeline, and the broader platform history in the AWS Generative AI History and Timeline.

Meta

Meta's contributions in this space have been primarily research: Make-A-Video (2022), the Emu image models (2023), and the Movie Gen suite (2024) were published as research rather than shipped as standalone products, though related technology appears in consumer features such as Meta AI's image tools.

Chinese labs (Kuaishou, MiniMax, Tencent, Alibaba, ByteDance)

Several Chinese labs became major video and image players from 2024 onward. Kuaishou's Kling (June 2024) and MiniMax's Hailuo (August 2024) are app- and API-based video models with rapid version cadences. Tencent released HunyuanVideo (December 2024) and Alibaba released Wan (2.1 in February 2025) as open-weight video models, while Alibaba's Qwen-Image (August 2025) is an open-weight image model. ByteDance's Seedream (image) and Seedance (video) round out its offering through app and API surfaces.

Specialist image models (Ideogram, Recraft)

Beyond the largest providers, a few specialists focus on particular image needs. Ideogram is known for text rendering within images, and Recraft (Recraft V3, October 2024) can generate vector (SVG) output in addition to raster images. Both are offered as apps and APIs; see the Ideogram features page and Recraft V3 announcement for details.

Frequently Asked Questions

When was Stable Diffusion released, and why is it considered a turning point?

Stable Diffusion was released publicly on August 22, 2022 with open weights under the CreativeML OpenRAIL-M license, developed by the CompVis group together with Stability AI, Runway, and others. It is widely seen as a turning point because it was the first capable text-to-image model whose weights were openly available, so it could run on consumer GPUs and be freely fine-tuned - which seeded a large open ecosystem (SDXL, Stable Diffusion 3.5, and later FLUX and other open-weight models). See the official Stable Diffusion Public Release.

When did DALL·E launch?

OpenAI introduced the original DALL·E on January 5, 2021 as a research model. DALL·E 2 followed as a research preview on April 6, 2022 and opened to the public in September 2022, and DALL·E 3 was announced on September 20, 2023. In March 2025, GPT-4o native image generation replaced DALL·E 3 as the default image generator in ChatGPT.

When did Sora come out - the announcement or the public release?

Both dates matter and are often confused. OpenAI announced Sora as a research preview on February 15, 2024, but it was not publicly usable then; it launched publicly as "Sora Turbo" on December 9, 2024 at sora.com and in ChatGPT. The next-generation Sora 2 (with audio) was announced on September 30, 2025. See Sora is here and Sora 2.

What are the major open-weight image and video generation models?

For images, the main open-weight lines are Stability AI's Stable Diffusion, SDXL, and Stable Diffusion 3.5, Black Forest Labs' FLUX.1 and FLUX.2 ([dev] and [schnell] tiers), and Alibaba's Qwen-Image. For video, open-weight options include Stable Video Diffusion, Tencent's HunyuanVideo, and Alibaba's Wan. Licenses vary - some are permissive (for example Apache-2.0 for FLUX.1 [schnell], Qwen-Image, and Wan 2.1), while others are non-commercial or community licenses - so check each model's official license before use.

When did image generation move inside large language models?

Image generation began as dedicated diffusion systems (DALL·E, Imagen, Stable Diffusion). The shift to generating images inside a multimodal LLM became mainstream with OpenAI's GPT-4o native image generation (March 25, 2025) and Google's Gemini 2.5 Flash Image (August 26, 2025). Specialized open-weight and API diffusion/transformer models continue to advance alongside this trend.

Which timeline covers Amazon Nova Canvas and Nova Reel in detail?

This cross-vendor page lists Amazon's Nova Canvas (image) and Nova Reel (video) among the other providers. For a dedicated, Bedrock-focused history of the whole Amazon Nova family - including regions, cross-Region inference, and integration with Amazon Bedrock features - see my Amazon Nova Model Release Timeline.

Does this timeline cover audio, music, speech, or 3D generation?

No. This page is deliberately limited to image and video generation. Audio, music, and speech generation (for example speech-to-speech and music models) and 3D generation are out of scope and may be covered in a separate future article. Pricing and benchmark scores are also intentionally omitted, in keeping with the neutral, dated focus of this timeline.

Summary

In this article, I built a cross-vendor release timeline for image and video generation models, from OpenAI's original DALL·E (2021) and the open-weight release of Stable Diffusion (2022) through the 2025-2026 generation of native in-model image generation and audio-synchronized video, organized as a landscape map, a chronological image timeline, a chronological video timeline, the evolution of capabilities and access, and a vendor-by-vendor summary.

Two threads stand out. First, access keeps forking between open-weight models that anyone can download and closed models reached only through an API or app, with strong ecosystems on both sides. Second, generation is converging into general multimodal models - image generation is increasingly a capability of an LLM (GPT-4o, Gemini), and video is moving toward audio-synchronized output (Veo 3, Sora 2) - even as specialized image and video models keep advancing in parallel. Throughout, I have kept the record neutral and dated: each row links to an official source, research announcements are distinguished from released products, and no pricing or benchmark comparisons are included.

I will continue to monitor how image and video generation models evolve across vendors. The most reliable places to track new announcements are each vendor's official news pages linked throughout this article.

In addition, there are also related model-timeline and history articles on hidekazu-konishi.com, so please have a look if you are interested.

Anthropic Claude Model Release Timeline - Model Family Tree, Capability Evolution, and Platform Availability
OpenAI GPT Model Release Timeline - Model Lineage, ChatGPT and Codex Milestones, and Platform Availability
Amazon Nova Model Release Timeline - Model Family, Capability Evolution, and Availability on Amazon Bedrock
AWS Generative AI History and Timeline - From SageMaker JumpStart to Bedrock AgentCore

References


References:
Tech Blog with curated related content

Written by Hidekazu Konishi