← Blog

Technical Blog Posts

Authoritative technical announcements and deep dives from the laboratories advancing multimodal AI — curated for scholarly depth and engineering insight.

⌘K

Text-to-Image Generation 20 posts

Black Forest Labs

Announcing FLUX.2 — Frontier Visual Intelligence

The most capable model from BFL to date — featuring image editing at 4MP, complex text rendering, multi-reference consistency, and enhanced spatial reasoning across four model variants.

Black Forest Labs

FLUX.2 [klein] — Unified Generation & Editing Under One Second

BFL's fastest model offering unified generation and editing in sub-second inference, runnable on consumer hardware. The 4B-parameter variant is released under Apache 2.0.

Black Forest Labs

FLUX.2 VAE — Analyzing and Enhancing the Latent Space

A comprehensive research report comparing autoencoder architectures (SD-VAE, RAE, FLUX.1 VAE, FLUX.2 VAE) and their impact on latent space representation quality for image synthesis.

Black Forest Labs

FLUX.1 Kontext — Flow Matching for Unified Generation & Editing

Documenting BFL's flow matching approach for unified image generation and editing in latent space — the architectural foundation that preceded FLUX.2.

Stability AI

Stable Diffusion 3 — Research Paper

The technical paper behind SD3's Multimodal Diffusion Transformer (MMDiT) architecture — outperforming DALL-E 3, Midjourney v6, and Ideogram v1 in typography and prompt adherence via human evaluation.

Stability AI

Introducing Stable Diffusion 3.5

Three model variants (Large 8.1B, Large Turbo, Medium 2.5B) with improved MMDiT dual attention layers and QK normalization — free for commercial use, running on consumer hardware.

ByteDance Seed

Seedream 3.0 — Text-to-Image Technical Report

ByteDance's scaled-up diffusion transformer achieving top arena rankings through massive compute and architectural innovations — surpassing prior open-source models on multiple benchmarks.

ByteDance Seed

Seedream 4.0 — Beyond Drawing, Into Imagination

A unified architecture for T2I generation and editing supporting 4K output, 30+ artistic styles, 6 image references, and 10x faster reasoning — with advanced logical understanding capabilities.

Hugging Face

Diffusers Welcomes Stable Diffusion 3.5 Large

Technical walkthrough of integrating SD 3.5 into the Diffusers library — covering the improved MMDiT architecture, QK normalization, LoRA adapters, and optimization techniques.

Photoroom & Hugging Face

PRX — Open-Sourcing a 1.3B Text-to-Image Model

A 1.3B-parameter T2I model trained in under 10 days on 32 H200 GPUs — featuring REPA, distillation, SFT, and DPO. Released under Apache 2.0 with full training transparency.

Hugging Face

Diffusers Welcomes FLUX.2 — Architecture Deep Dive

Technical walkthrough of FLUX.2's new MM-DiT architecture — single Mistral Small 3.1 encoder, SwiGLU activations, 8+48 block design, bias-free layers, and shared modulation parameters.

Hugging Face

FLUX.1 QLoRA — Fine-Tuning a 12B T2I Model on Consumer Hardware

Efficient fine-tuning of FLUX.1-dev using QLoRA with under 10 GB VRAM on an RTX 4090 — practical guide covering transformer-only training, frozen encoders, and memory optimization techniques.

Mistral AI

Pixtral 12B & Pixtral Large — Frontier Open Multimodal Models

Mistral's vision-language models: 12B (Apache 2.0) with 400M vision encoder excelling at chart/document QA, and 124B outperforming GPT-4o on MathVista — both handling variable image sizes and 128K context.

Jay Alammar

The Illustrated Stable Diffusion

The definitive visual explainer of latent diffusion — breaking down CLIP text encoding, the UNet denoiser, latent space compression, and the iterative refinement process with beautifully crafted diagrams.

Towards Data Science

A Visual Guide to How Diffusion Models Work

Minimal-math visual guide to diffusion models — explains what they learn, why they work, and how to extract outputs using a "glyffuser" trained on Chinese glyphs as a didactic example.

Simon Willison

GPT-4o Image Generation & the gpt-image-1 API

Hands-on testing of OpenAI's GPT-4o native image generation — benchmarking prompt fidelity, text rendering, public figure policy changes, API pricing analysis, and comparison with FLUX.2-klein.

Apple ML Research

Matryoshka Diffusion Models — End-to-End High-Resolution Synthesis

Multi-scale joint diffusion enabling end-to-end 1024x1024 image and video synthesis — a nested architecture that denoises at multiple resolutions simultaneously for efficient high-fidelity generation.

OpenAI

DALL·E 3 — Improving Image Generation with Better Captions

OpenAI's breakthrough in caption fidelity — training a SOTA image captioner to produce detailed descriptions, then training the model on these enhanced captions for dramatically improved prompt adherence, text rendering, and fine detail.

Hugging Face & Stability AI

Diffusers Welcomes Stable Diffusion 3 — MMDiT Architecture Deep Dive

Technical walkthrough of SD3's Multimodal Diffusion Transformer (MMDiT) — separate weights for text and image embeddings with bidirectional attention, rectified flow matching training, and the new FlowMatchEulerDiscreteScheduler.

Adobe Research

Adobe Firefly Image 3 Model — Advancing Creative Ideation

Adobe's commercially-safe image model trained on licensed content — enhanced photorealistic quality, Structure & Style Reference capabilities, and superior complex prompt understanding integrated across Creative Cloud.

Text-to-Video & Image-to-Video Generation 20 posts

OpenAI

Sora — Creating Video from Text

OpenAI's seminal video generation model announcement — introducing spacetime patches, diffusion transformers for video, and the vision of video as a path toward world simulators.

OpenAI

Sora 2 Is Here

Major upgrade with accurate physics modeling, synchronized dialogue and sound, multi-shot sequences with persistent world state, and two model variants (Sora 2 and Sora 2 Pro).

Google DeepMind

State-of-the-Art Video and Image Generation with Veo 2 and Imagen 3

DeepMind's announcement of Veo 2 for 8-second video generation and Imagen 3 for photorealistic T2I — with SynthID watermarking, physics-aware temporal synthesis, and diverse cinematic styles.

Google DeepMind

Empowering YouTube Creators with Generative AI

How Veo is integrated into YouTube's Dream Screen for video background generation and standalone clips — bridging research models with creator tools at billion-user scale.

Meta AI

Movie Gen — Media Foundation Models for Generative AI Video

A 30B-parameter transformer generating 16s of 1080p HD video with synchronized audio — SOTA on text-to-video, personalization, video editing, and audio generation benchmarks.

Meta AI

Introducing Emu Video and Emu Edit

Factorized text-to-video generation via explicit image conditioning — two diffusion models producing 512x512 video at 4s, preferred over Make-A-Video, Imagen Video, and commercial solutions.

Runway

Introducing Gen-3 Alpha — A New Frontier for Video Generation

Joint video-image training for enhanced visual fidelity — featuring temporally dense captions for precise keyframing, Motion Brush, Advanced Camera Controls, Director Mode, and C2PA provenance.

Alibaba

Wan 2.2 — Open-Source MoE Video Generation

The first open-source large video model with Mixture-of-Experts: 27B total parameters, 14B active per step, Apache 2.0 license. Two specialized experts for high-noise and low-noise stages.

Tencent

HunyuanVideo 1.5 — Efficient 8.3B Video Generation

Compact 8.3B-parameter model with Selective and Sliding Tile Attention (SSTA) achieving 1.87x speedup. 3D Causal VAE with 16x spatial and 4x temporal compression, plus 1080p super-resolution.

Tencent

HunyuanCustom — Multimodal Video Customization

A multimodal video customization framework supporting text, image, audio, and video inputs — enabling singing avatars, virtual try-on, and consistent character generation across scenes.

Luma AI

Dream Machine Ray2 — 10x More Compute, Next-Level Realism

Luma's Ray2 model trained with 10x more compute than previous versions — featuring Photon T2I, character references, camera motion control, and the Inductive Moment Matching (IMM) paradigm.

Luma AI

Ray3.14 — Native 1080p, 4x Faster, 3x Cheaper

Luma's most advanced model with native 1080p video, dramatically improved speed and cost efficiency — featuring Modify Video for director-grade editing and natural language VFX instructions.

Lil'Log (Lilian Weng)

Diffusion Models for Video Generation

A comprehensive tutorial covering diffusion-based video generation from foundations — 3D U-Net vs. DiT architectures, temporal consistency, video training data, and adapting image diffusion models to video.

Pika

Pika 2.5 — Ultra-Realistic Video Generation

Pikaframes for 5–25 second sequences with keyframe control, advanced creative effects (Pikadditions, Pikaswaps, Pikatwists), and improved natural motion at 1080p resolution.

Stability AI

Introducing Stable Video Diffusion — Open AI Video Model

The first open foundation model for generative video — 3-stage latent video diffusion training, 576x1024 at 14/25 frames, preferred over closed-source models in user studies. Code + weights on Hugging Face.

Towards Data Science

Sora — Intuitively and Exhaustively Explained

A comprehensive visual walkthrough of Sora's architecture — spacetime patch tokenization, latent space compression, DiT denoising process, and emergent 3D consistency, explained from first principles.

OpenAI Research

Video Generation Models as World Simulators

The seminal Sora technical report — introducing spacetime patches, diffusion transformers for video, and the thesis that scaling video generation is a path toward general-purpose physical world simulators.

Towards AI (Jesus Rodriguez)

Inside OpenAI Sora — Five Key Technical Details

Deep analysis of Sora's architecture — visual patch tokenization, latent space compression, variable resolution/duration training, diffusion transformer scaling, and emergent 3D consistency.

Apple ML Research

STIV — Scalable Text and Image Conditioned Video Generation

Apple's 8.7B-parameter video model surpassing CogVideoX-5B, Pika, and Kling — 83.1 on VBench T2V, supporting text-to-video, image-to-video, video prediction, and multi-view generation.

Microsoft Research

CoDi-2 — In-Context Any-to-Any Multimodal Generation

MLLM performing any-to-any generation across text, vision, and audio — enabling in-context learning and few-shot capabilities for editing, composition, and subject-driven generation.

3D Vision 9 posts

INRIA GraphDeco

3D Gaussian Splatting for Real-Time Radiance Field Rendering

The SIGGRAPH 2023 Best Paper that launched the 3DGS revolution — anisotropic Gaussians with interleaved optimization, achieving 100+ fps at 1080p while matching Mip-NeRF 360 quality in 30 minutes.

INRIA

Creating Stunning Real-Time 3D Scenes — The Breakthrough of 3DGS

INRIA's accessible overview of how 3D Gaussian Splatting represents a fundamental shift from volumetric to point-based rendering — covering the journey from NeRF to real-time neural scene representations.

NVIDIA Research

3D Gaussian Ray Tracing — Fast Tracing of Particle Scenes

SIGGRAPH Asia 2024 — replacing rasterization with GPU ray tracing hardware for 3DGS. Enables secondary lighting effects (shadows, reflections) and handles distorted cameras common in robotics.

Karthick AI

Gaussian Splatting — A Deep Dive into 3D Data Visualization

An in-depth technical tutorial explaining Gaussian Splatting from first principles — covering 3D Gaussian representation, covariance matrices, depth compositing, and the full rendering pipeline.

Naver Labs Europe

DUSt3R — Geometric 3D Vision Made Easy

CVPR 2024 — eliminating camera calibration for 3D reconstruction by directly regressing pointmaps from arbitrary image collections. End-to-end pipeline unifying monocular and stereo reconstruction without camera intrinsics.

Hugging Face

Introduction to 3D Gaussian Splatting

Hugging Face's accessible introduction to 3DGS — covering the fundamentals of Gaussian primitives, differentiable rasterization, optimization pipeline, and integration with the ML ecosystem.

LearnOpenCV

3D Gaussian Splatting — Paper Explanation & Custom Dataset Training

Comprehensive hands-on tutorial — paper explanation, SfM initialization, adaptive density control, differentiable rasterization, plus step-by-step training on custom datasets with Nerfstudio Gsplats.

Radiance Fields Newsletter

Radiance Fields in 2025 — State of the Art

Annual industry review — Hollywood and NVIDIA adopting 3DGS at scale, Apple and Tesla demonstrations, city-scale reconstruction, Brush 0.2, and the convergence of splatting with game engines.

Tencent Hunyuan

Hunyuan3D 2.0 — Scalable 3D Asset Generation with DiT & Paint

Tencent's open-source 3D generation pipeline: a flow-based DiT for geometry with importance sampling, plus a multi-view consistent texture painter — outperforming closed-source models in geometry detail, alignment, and texture quality.

4D Spatial Intelligence 6 posts

4DGF Team

Dynamic 3D Gaussian Fields for Urban Areas

NeurIPS 2024 Spotlight — extending 3DGS to large-scale dynamic urban scenes via scene graphs with neural appearance fields. 3+ dB PSNR improvement and 100x faster rendering over prior methods.

Instant4D Team

Instant4D — 4D Gaussian Splatting in Minutes

Reconstructing casual monocular videos in ~2 minutes — grid pruning reduces Gaussians by 92% with 30x speedup. No calibrated cameras required for in-the-wild 4D reconstruction.

MPI-INF

Dynamic EventNeRF — Reconstructing Dynamic Scenes from Event Streams

Combining event cameras with RGB frames for 4D reconstruction of fast-moving scenes — time-conditioned NeRF with Fast Event Accumulation with Decay, working with as few as two training views.

KAIST CVLab

D²USt3R — Feed-Forward 4D Pointmaps for Dynamic Scenes

Extending DUSt3R to dynamic scenes with feed-forward 4D pointmap regression — optical flow alignment and dynamic masks handle occluded regions. First to capture both static and dynamic geometry simultaneously.

St4RTrack Team

St4RTrack — Simultaneous 4D Reconstruction and Tracking in the World

Unified world-frame 4D reconstruction and point tracking from RGB video — aligned pointmap pairs chained through sequences for long-range correspondence without 4D ground truth supervision.

Apple ML Research

Depth Pro — Sharp Monocular Metric Depth in Less Than a Second

Zero-shot metric depth estimation producing 2.25-megapixel depth maps in 0.3s — multi-scale ViT combining global context with fine details, no camera intrinsics required. Open-source with code & weights.

Unified Multimodal Models 8 posts

Alibaba Qwen

Qwen2-VL — To See the World More Clearly

Alibaba's next-gen vision-language model with Naive Dynamic Resolution (arbitrary aspect ratios), Multimodal RoPE for 3D positional awareness, 20+ minute video understanding, and multilingual document OCR — open-source from 2B to 72B.

Anthropic

Introducing Claude 3 — A New Trio of Multimodal Models

The Claude 3 family (Opus, Sonnet, Haiku) — Anthropic's first natively multimodal models combining vision and language, processing images alongside text with up to 1M tokens context. Opus outperforms GPT-4 on reasoning benchmarks.

Meta AI

Chameleon — Mixed-Modal Early-Fusion Foundation Models

A 34B-parameter early-fusion model processing text and images as unified discrete tokens — trained on 10 trillion multimodal tokens, achieving SOTA on mixed-modal tasks and preferred over GPT-4V.

Google

Introducing Gemini 1.5 — Next-Generation Multimodal AI

The 1M-token context window breakthrough — natively processing text, images, video, and audio within a single transformer. Dramatically enhanced long-context understanding across modalities.

Google

Gemini 3 Pro — The Frontier of Vision AI

Google's most capable multimodal model — SOTA on document, spatial, screen, and video understanding. New benchmarks on MMMU Pro and Video MMMU with unprecedented vision reasoning capabilities.

Google Developers

7 Examples of Gemini's Multimodal Capabilities in Action

Practical demonstrations of Gemini's multimodal capabilities — from detailed image reasoning and 1000+ page PDF processing to video understanding and structured data extraction.

Sebastian Raschka, PhD

Understanding Multimodal LLMs

Comprehensive survey of multimodal LLM architectures — unified embedding-decoder vs. cross-modality attention approaches, reviewing ~12 recent papers including Llama 3.2, LLaVA, and BLIP-2.

Chip Huyen

Multimodality and Large Multimodal Models (LMMs)

Foundational overview of multimodal systems — from CLIP and Flamingo to BLIP-2, LLaVA, and LLaMA-Adapter V2. Covers architectural paradigms, efficient adapters, and the trajectory toward unified models.

World Models 10 posts

NVIDIA

Advancing Physical AI with NVIDIA Cosmos World Foundation Model Platform

Open world foundation models pretrained on 9,000 trillion tokens — Cosmos Predict, Transfer, and Reason for generating physics-based synthetic data to train robots and autonomous vehicles.

Waymo

The Waymo World Model — A New Frontier for Autonomous Driving Simulation

Built upon Google DeepMind's Genie 3 — generating hyper-realistic multi-sensor simulated environments with camera and lidar data. Simulates rare edge-cases from tornadoes to vehicles on fire.

Waymo

Demonstrably Safe AI for Autonomous Driving

The Waymo Foundation Model with "Think Fast and Think Slow" architecture — combining a Sensor Fusion Encoder for rapid driving decisions with a Driving VLM powered by Gemini for complex reasoning.

Waymo

EMMA — End-to-End Multimodal Model for Autonomous Driving

Powered by Gemini — generating vehicle trajectories directly from sensor data with chain-of-thought reasoning. Improved performance through multimodal world knowledge integration.

Wayve

GAIA-1 — A Generative World Model for Autonomous Driving

Casting world modeling as unsupervised sequence modeling — video, text, and action tokens predict future driving scenarios. Scaled to 9B parameters with a 6.5B world model for high-resolution generation.

Microsoft Research

Interactive World Simulator — Compositional 4D Video Generation

Decomposing text prompts into individual 3D concepts using LLMs as directors, then composing them with 2D diffusion priors for high-fidelity video generation with flexible motion and viewpoint control.

The Batch (deeplearning.ai)

Generative Video Models Revolutionize Content Creation

Andrew Ng's newsletter surveying the video generation landscape — Meta Movie Gen, Adobe Firefly Video, Runway Gen-3, Kling AI, and the convergence of T2V with world simulation capabilities.

Google DeepMind

Genie 2 — A Large-Scale Foundation World Model

Generating vast diversity of rich 3D worlds from single prompt images — users and AI agents interact via keyboard/mouse. Emergent capabilities include physics simulation, object interactions, and complex character animation.

Google DeepMind

Genie 3 — A New Frontier for World Models

Real-time interactive world generation at 24fps and 720p — text-to-world with consistent environments lasting several minutes. Models physical properties including water, lighting, and complex dynamics.

Meta AI

V-JEPA — The Next Step Toward Yann LeCun's Vision of Advanced Machine Intelligence

Video Joint Embedding Predictive Architecture — learning to predict masked video regions in abstract representation space (not pixel-level), 1.5x–6x more efficient than generative approaches. A building block toward world models that learn through observation.