Announcing FLUX.2 — Frontier Visual Intelligence
The most capable model from BFL to date — featuring image editing at 4MP, complex text rendering, multi-reference consistency, and enhanced spatial reasoning across four model variants.
Authoritative technical announcements and deep dives from the laboratories advancing multimodal AI — curated for scholarly depth and engineering insight.
The most capable model from BFL to date — featuring image editing at 4MP, complex text rendering, multi-reference consistency, and enhanced spatial reasoning across four model variants.
BFL's fastest model offering unified generation and editing in sub-second inference, runnable on consumer hardware. The 4B-parameter variant is released under Apache 2.0.
A comprehensive research report comparing autoencoder architectures (SD-VAE, RAE, FLUX.1 VAE, FLUX.2 VAE) and their impact on latent space representation quality for image synthesis.
Documenting BFL's flow matching approach for unified image generation and editing in latent space — the architectural foundation that preceded FLUX.2.
The technical paper behind SD3's Multimodal Diffusion Transformer (MMDiT) architecture — outperforming DALL-E 3, Midjourney v6, and Ideogram v1 in typography and prompt adherence via human evaluation.
Three model variants (Large 8.1B, Large Turbo, Medium 2.5B) with improved MMDiT dual attention layers and QK normalization — free for commercial use, running on consumer hardware.
ByteDance's scaled-up diffusion transformer achieving top arena rankings through massive compute and architectural innovations — surpassing prior open-source models on multiple benchmarks.
A unified architecture for T2I generation and editing supporting 4K output, 30+ artistic styles, 6 image references, and 10x faster reasoning — with advanced logical understanding capabilities.
Technical walkthrough of integrating SD 3.5 into the Diffusers library — covering the improved MMDiT architecture, QK normalization, LoRA adapters, and optimization techniques.
A 1.3B-parameter T2I model trained in under 10 days on 32 H200 GPUs — featuring REPA, distillation, SFT, and DPO. Released under Apache 2.0 with full training transparency.
Technical walkthrough of FLUX.2's new MM-DiT architecture — single Mistral Small 3.1 encoder, SwiGLU activations, 8+48 block design, bias-free layers, and shared modulation parameters.
Efficient fine-tuning of FLUX.1-dev using QLoRA with under 10 GB VRAM on an RTX 4090 — practical guide covering transformer-only training, frozen encoders, and memory optimization techniques.
Mistral's vision-language models: 12B (Apache 2.0) with 400M vision encoder excelling at chart/document QA, and 124B outperforming GPT-4o on MathVista — both handling variable image sizes and 128K context.
The definitive visual explainer of latent diffusion — breaking down CLIP text encoding, the UNet denoiser, latent space compression, and the iterative refinement process with beautifully crafted diagrams.
Minimal-math visual guide to diffusion models — explains what they learn, why they work, and how to extract outputs using a "glyffuser" trained on Chinese glyphs as a didactic example.
Hands-on testing of OpenAI's GPT-4o native image generation — benchmarking prompt fidelity, text rendering, public figure policy changes, API pricing analysis, and comparison with FLUX.2-klein.
Multi-scale joint diffusion enabling end-to-end 1024x1024 image and video synthesis — a nested architecture that denoises at multiple resolutions simultaneously for efficient high-fidelity generation.
OpenAI's breakthrough in caption fidelity — training a SOTA image captioner to produce detailed descriptions, then training the model on these enhanced captions for dramatically improved prompt adherence, text rendering, and fine detail.
Technical walkthrough of SD3's Multimodal Diffusion Transformer (MMDiT) — separate weights for text and image embeddings with bidirectional attention, rectified flow matching training, and the new FlowMatchEulerDiscreteScheduler.
Adobe's commercially-safe image model trained on licensed content — enhanced photorealistic quality, Structure & Style Reference capabilities, and superior complex prompt understanding integrated across Creative Cloud.
OpenAI's seminal video generation model announcement — introducing spacetime patches, diffusion transformers for video, and the vision of video as a path toward world simulators.
Major upgrade with accurate physics modeling, synchronized dialogue and sound, multi-shot sequences with persistent world state, and two model variants (Sora 2 and Sora 2 Pro).
DeepMind's announcement of Veo 2 for 8-second video generation and Imagen 3 for photorealistic T2I — with SynthID watermarking, physics-aware temporal synthesis, and diverse cinematic styles.
How Veo is integrated into YouTube's Dream Screen for video background generation and standalone clips — bridging research models with creator tools at billion-user scale.
A 30B-parameter transformer generating 16s of 1080p HD video with synchronized audio — SOTA on text-to-video, personalization, video editing, and audio generation benchmarks.
Factorized text-to-video generation via explicit image conditioning — two diffusion models producing 512x512 video at 4s, preferred over Make-A-Video, Imagen Video, and commercial solutions.
Joint video-image training for enhanced visual fidelity — featuring temporally dense captions for precise keyframing, Motion Brush, Advanced Camera Controls, Director Mode, and C2PA provenance.
The first open-source large video model with Mixture-of-Experts: 27B total parameters, 14B active per step, Apache 2.0 license. Two specialized experts for high-noise and low-noise stages.
Compact 8.3B-parameter model with Selective and Sliding Tile Attention (SSTA) achieving 1.87x speedup. 3D Causal VAE with 16x spatial and 4x temporal compression, plus 1080p super-resolution.
A multimodal video customization framework supporting text, image, audio, and video inputs — enabling singing avatars, virtual try-on, and consistent character generation across scenes.
Luma's Ray2 model trained with 10x more compute than previous versions — featuring Photon T2I, character references, camera motion control, and the Inductive Moment Matching (IMM) paradigm.
Luma's most advanced model with native 1080p video, dramatically improved speed and cost efficiency — featuring Modify Video for director-grade editing and natural language VFX instructions.
A comprehensive tutorial covering diffusion-based video generation from foundations — 3D U-Net vs. DiT architectures, temporal consistency, video training data, and adapting image diffusion models to video.
Pikaframes for 5–25 second sequences with keyframe control, advanced creative effects (Pikadditions, Pikaswaps, Pikatwists), and improved natural motion at 1080p resolution.
The first open foundation model for generative video — 3-stage latent video diffusion training, 576x1024 at 14/25 frames, preferred over closed-source models in user studies. Code + weights on Hugging Face.
A comprehensive visual walkthrough of Sora's architecture — spacetime patch tokenization, latent space compression, DiT denoising process, and emergent 3D consistency, explained from first principles.
The seminal Sora technical report — introducing spacetime patches, diffusion transformers for video, and the thesis that scaling video generation is a path toward general-purpose physical world simulators.
Deep analysis of Sora's architecture — visual patch tokenization, latent space compression, variable resolution/duration training, diffusion transformer scaling, and emergent 3D consistency.
Apple's 8.7B-parameter video model surpassing CogVideoX-5B, Pika, and Kling — 83.1 on VBench T2V, supporting text-to-video, image-to-video, video prediction, and multi-view generation.
MLLM performing any-to-any generation across text, vision, and audio — enabling in-context learning and few-shot capabilities for editing, composition, and subject-driven generation.
The SIGGRAPH 2023 Best Paper that launched the 3DGS revolution — anisotropic Gaussians with interleaved optimization, achieving 100+ fps at 1080p while matching Mip-NeRF 360 quality in 30 minutes.
INRIA's accessible overview of how 3D Gaussian Splatting represents a fundamental shift from volumetric to point-based rendering — covering the journey from NeRF to real-time neural scene representations.
SIGGRAPH Asia 2024 — replacing rasterization with GPU ray tracing hardware for 3DGS. Enables secondary lighting effects (shadows, reflections) and handles distorted cameras common in robotics.
An in-depth technical tutorial explaining Gaussian Splatting from first principles — covering 3D Gaussian representation, covariance matrices, depth compositing, and the full rendering pipeline.
CVPR 2024 — eliminating camera calibration for 3D reconstruction by directly regressing pointmaps from arbitrary image collections. End-to-end pipeline unifying monocular and stereo reconstruction without camera intrinsics.
Hugging Face's accessible introduction to 3DGS — covering the fundamentals of Gaussian primitives, differentiable rasterization, optimization pipeline, and integration with the ML ecosystem.
Comprehensive hands-on tutorial — paper explanation, SfM initialization, adaptive density control, differentiable rasterization, plus step-by-step training on custom datasets with Nerfstudio Gsplats.
Annual industry review — Hollywood and NVIDIA adopting 3DGS at scale, Apple and Tesla demonstrations, city-scale reconstruction, Brush 0.2, and the convergence of splatting with game engines.
Tencent's open-source 3D generation pipeline: a flow-based DiT for geometry with importance sampling, plus a multi-view consistent texture painter — outperforming closed-source models in geometry detail, alignment, and texture quality.
NeurIPS 2024 Spotlight — extending 3DGS to large-scale dynamic urban scenes via scene graphs with neural appearance fields. 3+ dB PSNR improvement and 100x faster rendering over prior methods.
Reconstructing casual monocular videos in ~2 minutes — grid pruning reduces Gaussians by 92% with 30x speedup. No calibrated cameras required for in-the-wild 4D reconstruction.
Combining event cameras with RGB frames for 4D reconstruction of fast-moving scenes — time-conditioned NeRF with Fast Event Accumulation with Decay, working with as few as two training views.
Extending DUSt3R to dynamic scenes with feed-forward 4D pointmap regression — optical flow alignment and dynamic masks handle occluded regions. First to capture both static and dynamic geometry simultaneously.
Unified world-frame 4D reconstruction and point tracking from RGB video — aligned pointmap pairs chained through sequences for long-range correspondence without 4D ground truth supervision.
Zero-shot metric depth estimation producing 2.25-megapixel depth maps in 0.3s — multi-scale ViT combining global context with fine details, no camera intrinsics required. Open-source with code & weights.
Alibaba's next-gen vision-language model with Naive Dynamic Resolution (arbitrary aspect ratios), Multimodal RoPE for 3D positional awareness, 20+ minute video understanding, and multilingual document OCR — open-source from 2B to 72B.
The Claude 3 family (Opus, Sonnet, Haiku) — Anthropic's first natively multimodal models combining vision and language, processing images alongside text with up to 1M tokens context. Opus outperforms GPT-4 on reasoning benchmarks.
A 34B-parameter early-fusion model processing text and images as unified discrete tokens — trained on 10 trillion multimodal tokens, achieving SOTA on mixed-modal tasks and preferred over GPT-4V.
The 1M-token context window breakthrough — natively processing text, images, video, and audio within a single transformer. Dramatically enhanced long-context understanding across modalities.
Google's most capable multimodal model — SOTA on document, spatial, screen, and video understanding. New benchmarks on MMMU Pro and Video MMMU with unprecedented vision reasoning capabilities.
Practical demonstrations of Gemini's multimodal capabilities — from detailed image reasoning and 1000+ page PDF processing to video understanding and structured data extraction.
Comprehensive survey of multimodal LLM architectures — unified embedding-decoder vs. cross-modality attention approaches, reviewing ~12 recent papers including Llama 3.2, LLaVA, and BLIP-2.
Foundational overview of multimodal systems — from CLIP and Flamingo to BLIP-2, LLaVA, and LLaMA-Adapter V2. Covers architectural paradigms, efficient adapters, and the trajectory toward unified models.
Open world foundation models pretrained on 9,000 trillion tokens — Cosmos Predict, Transfer, and Reason for generating physics-based synthetic data to train robots and autonomous vehicles.
Built upon Google DeepMind's Genie 3 — generating hyper-realistic multi-sensor simulated environments with camera and lidar data. Simulates rare edge-cases from tornadoes to vehicles on fire.
The Waymo Foundation Model with "Think Fast and Think Slow" architecture — combining a Sensor Fusion Encoder for rapid driving decisions with a Driving VLM powered by Gemini for complex reasoning.
Powered by Gemini — generating vehicle trajectories directly from sensor data with chain-of-thought reasoning. Improved performance through multimodal world knowledge integration.
Casting world modeling as unsupervised sequence modeling — video, text, and action tokens predict future driving scenarios. Scaled to 9B parameters with a 6.5B world model for high-resolution generation.
Decomposing text prompts into individual 3D concepts using LLMs as directors, then composing them with 2D diffusion priors for high-fidelity video generation with flexible motion and viewpoint control.
Andrew Ng's newsletter surveying the video generation landscape — Meta Movie Gen, Adobe Firefly Video, Runway Gen-3, Kling AI, and the convergence of T2V with world simulation capabilities.
Generating vast diversity of rich 3D worlds from single prompt images — users and AI agents interact via keyboard/mouse. Emergent capabilities include physics simulation, object interactions, and complex character animation.
Real-time interactive world generation at 24fps and 720p — text-to-world with consistent environments lasting several minutes. Models physical properties including water, lighting, and complex dynamics.
Video Joint Embedding Predictive Architecture — learning to predict masked video regions in abstract representation space (not pixel-level), 1.5x–6x more efficient than generative approaches. A building block toward world models that learn through observation.