Will a transformer based model be SOTA for video generation by the end of 2025?

MANIFOLD

Ṁ210Ṁ310

Jan 1

82%

chance

ALL

Resolves to yes or no if there's a widely agreed on benchmark or it's obvious in popularity.

Resolves N/A if there's no clear consensus or benchmark.

Market context

Technical AI Timelines

Get

1,000

to start trading!

2 Comments

10 Holders

13 Trades

Sort by:

Resolution: YES

Reasoning (as of 1 Jan 2026):

By the end of 2025, the clear, widely recognized state-of-the-art video generators (most notably OpenAI’s Sora, along with competitors like HunyuanVideo) are transformer-based architectures (typically diffusion models with Transformer backbones, e.g. DiT-style).

While there is no single universally agreed benchmark for video generation quality, the outcome is obvious in popularity, capability demonstrations, and expert consensus:

Sora was broadly regarded throughout 2024–2025 as the SOTA reference for general text-to-video.

Its architecture is transformer-based, and competing SOTA-level systems followed the same paradigm.

There was no competing non-transformer paradigm that displaced transformers at the top by EOY 2025.

Under your rules:

✔ Obvious in popularity / consensus → resolve YES

✘ Not ambiguous enough to require N/A

Final answer: YES.

@Areal the model people mean when they talk about “Sora 2” as the SOTA video generator is OpenAI’s Sora 2 video and audio generation model.

Key facts:

Sora 2 is the second-generation video generation model released by OpenAI, succeeding the original Sora model first shown in early 2024.

It generates realistic short videos (with synchronized audio) from text prompts and/or reference images/video.

OpenAI itself describes Sora 2 as its flagship video and audio generation model with improved realism, world physics behavior, and controllability over prior video AI systems.

So yes “Sora 2” refers to the leading model in this space that is widely regarded as the top or near-top video generator as of the end of 2025.