Skip to main content

World models

A world model is a learned predictor of what will happen next given the current observation and a candidate action. Replace the “what will happen” with anything: the next image frame, the next robot state, the reward thirty steps from now. Whatever the prediction target, a world model is the bridge between “imagine” and “commit.”

Why robotics is converging on world models now

For two decades the canonical robotics planner was a sampling- based search (RRT, PRM) over a kinematic state-space, sometimes wrapped in MPC over an analytically modelled environment. That works beautifully in structured industrial settings and breaks in three places:

  1. Unmodelled contact. Closed-form dynamics ignore friction transients, deformation, and stick-slip. A towel doesn’t fold the way a rigid-body simulator predicts.
  2. Semantic scenes. “Pick up the red mug” demands object recognition, grasping affordance, and pose estimation — three subsystems your motion planner cannot author.
  3. Multi-step task structure. “Unzip the school bag, then put the books in” is two policies and a re-grasp. Hard-coding the FSM doesn’t scale.

A generative model trained on enough robot, human, and synthetic video sidesteps all three. Friction shows up implicitly in the training distribution. Semantic prompts (“the red mug”) are conditioning. Multi-step tasks are just longer rollouts.

The τ₀-WM architecture

Midcore integrates τ₀-WM (“tau-zero world model”), published in May 2026 by Shanghai Innovation Institute and AGIBOT Finch under Apache 2.0. It packs two halves into a single training corpus:

Video Action Model (VAM) — 5.5 B parameters total

  • Video branch (5 B): a DiT-style transformer cloned from Wan2.2-TI2V-5B. Predicts a future latent trajectory zₜ₊₁…zₜ₊ₕ conditioned on the current observation, the language instruction, and the robot state.
  • Action branch (0.5 B): a second DiT-style decoder that emits the action chunk aₜ₊₁…aₜ₊ₖ for the same horizon.
  • Coupling: feature-level cross-attention at matched transformer stages, so the action branch sees the same intermediate visual features that the video branch is using to imagine the future.

Action-Conditioned Video Simulator (ACVS)

  • Reuses the same Wan VAE + video transformer backbone but removes the action-generating branch.
  • Conditions on a candidate action chunk (treated as a clean input, not generated) plus the visual context.
  • Outputs an imagined latent rollout and a per-frame reward trajectory r̂ₜ₊₁…r̂ₜ₊ₕ.
  • Functions as a learned simulator: “If we ran this chunk, here’s what the world would look like and how it would score.”

VAM ships today; ACVS is gated

At launch (May 2026) Shanghai Innovation Institute published the VAM weights only; ACVS weights are flagged “coming soon.” Midcore’s Simulation, Twin, and Safety sections render a pre-release banner until ACVS lights up. The moment the upstream weights land you can flip an environment variable on your deployment and the same panels activate end-to-end — the surfaces are already wired against the ACVS contract.

How a world model trains

τ₀-WM uses flow matching, the successor to denoising diffusion that’s now standard in large generative work. The model learns a velocity field that transports samples from a noise distribution to the data distribution along straight-line paths.

The video loss and the action loss are flow-matching residuals weighted equally:

  L_VAM = E[ λ_z · ‖f_θ^z(z̃, u_z, c_t, p) − v_z‖²
            + λ_a · ‖f_θ^a(ã, u_a, s_t, h) − v_a‖² ]

...where c_t is the current context (visual + state + language), p is the prompt, h is the intermediate video feature the action branch attends to, and u is the noise level. Both weights are 1 in the released training recipe.

The clever part is modality-specific supervision masks: each clip in the training corpus contributes only the losses it can support. A human egocentric video (no action vector) contributes only the video loss; a teleop episode contributes both; a failure recovery contributes both plus a reward channel.

The training corpus

τ₀-WM’s 27,300 hours of pre-training data split three ways:

SourceHoursWhat it teaches the model
Real-robot teleop17,800Action-grounded manipulation across AGIBOT-G01, ARX, dual-arm Franka platforms.
UMI-style handheld gripper6,500Universal Manipulation Interface — a person walks around with a gripper and captures the demonstration without needing a robot present.
Egocentric human video3,000Egodex, Egoverse, Xperience-10M — head-mounted-camera footage of humans doing tasks. Provides visual diversity without any robot data.
Rollout + failure trajectoriesmixed inProvides the reward channel ACVS needs to score futures.

Total compute: 64 H100 GPUs for 42 hours of pre-training, then 16 H100 GPUs for 26 hours of post-training. That’s roughly 3,100 H100-hours per checkpoint — out of reach for most teams to repeat from scratch, easily affordable for a domain fine-tune.

The action and state contract

τ₀-WM commits to a specific input/output shape that any compatible robot has to honour. Midcore’s Designer ships a “Dual-arm Franka FR3 (τ₀-WM ready)” template pre-configured to match.

ChannelDirectionShapeFrame
Stateobservation in14 ch = [left xyz + left quat (xyzw)] + [right xyz + right quat (xyzw)]Each EE pose in its own arm-base frame.
Gripper stateobservation in2 ch ∈ [0, 120]0 = open, 120 = closed.
Actionoutput{T, 16} chunk = [left EE pose (7) + left gripper (1) + right EE pose (7) + right gripper (1)] × TSame arm-base frames. Gripper output is normalised to [0, 1].

Internal vs wire representation

Internally τ₀-WM trains on a 20-channel relative pose representation (xyz + 6D rotation per arm), then converts to absolute pose + quaternion at the OpenPI wire boundary because quat is easier to debug. Midcore’s pose-math utilities handle both directions automatically — you author in quaternions, the model trains on 6D, the math stays right.

Why this architecture wins

Three design choices give τ₀-WM its edge over a flat VLA:

  1. Joint future + action learning. Asking the model to predict images and actions simultaneously means action gradients flow through the visual prediction loss too. The action head can’t hallucinate — the video branch acts as a regulariser.
  2. Modality-mixed pre-training. Human egocentric video brings semantic and physical breadth (humans see more places and do more things than any robot fleet); UMI brings action-grounded demonstrations without requiring scarce teleop time; real robot teleop calibrates the embodiment-specific bits.
  3. A learned reward / simulator channel. ACVS is what lets the deployed policy do test-time computation: imagine N candidate futures, score them, pick the best, refine. Without that channel a policy is one-shot.

What world models still can’t do

  • Tactile-rich manipulation. Vision alone is too sparse for insertion, fastening, and deformable objects. The τ₀-WM authors flag this explicitly as future work.
  • Calibrated uncertainty. The RCS confidence score is empirical — it works, but it’s not a calibrated probability. A model can confidently fail.
  • Long-horizon planning. Action chunks are ~16 steps. Anything longer is stitched at the policy level (next chunk conditions on this chunk’s execution).
  • Out-of-distribution embodiments. A 5-finger hand isn’t in the pre-training corpus. You can’t fine-tune your way to it.

Next: how the model actually emits actions

On to manipulation policies — what an action chunk is, what RCS measures, and what LAR does when RCS reports low confidence.