Glossary
Every term used in the robotics knowledge section, defined. Sorted alphabetically. Cross-linked from every other page.
- Action chunk
- A short sequence of future actions ({T, action_dim}) predicted in one inference call. Lets the policy and the low-level controller run independently between calls. τ₀-WM emits 16-channel chunks (left/right EE + grippers) over a typical horizon of 16 timesteps.
- ACVS (Action-Conditioned Video Simulator)
- The second model in the τ₀-WM family. Reuses the video backbone, removes the action branch, and conditions on a candidate action chunk to predict an imagined latent rollout plus a per-frame reward trajectory. Acts as a learned simulator at inference time.
- Arm-base frame
- The coordinate frame whose origin is the first non-moving link of an arm. τ₀-WM and most VLA models express end-effector poses in arm-base frames, not the world frame.
- DOF (Degrees of Freedom)
- The number of independent dimensions in which a mechanism can move. A standard 6-DOF arm can reach any pose; a 7-DOF cobot has redundancy for elbow position; a humanoid usually has 20–30 DOF.
- End-effector (EE)
- The business end of a manipulator — usually a gripper or tool. The EE pose (xyz + orientation) is what the policy commits to; joint angles are downstream details handled by inverse kinematics.
- Episode
- One continuous attempt at a task: from start to either success or failure. The unit of dataset capture.
- Flow matching
- A generative-model training technique. The model learns a velocity field that transports samples from a noise distribution to the data distribution along straight-line paths. Successor to denoising diffusion; what τ₀-WM uses for both its video and action heads.
- Forward / inverse kinematics (FK / IK)
- FK maps joint angles to an EE pose (always solvable). IK does the reverse (multiple solutions, harder).
- γ (gating threshold)
- The RCS score above which a proposed action chunk is considered commit-ready. Default 0.6. Below γ but above the hard floor, the operator must explicitly force-confirm.
- Hard floor
- The RCS score below which a proposed action chunk is blocked entirely. Default 0.2. The proposal is still recorded on the audit log for review.
- LAR (Low-quality Action Rectification)
- τ₀-WM’s one-shot correction loop. When RCS lands in the gated regime, ACVS scores all N candidates’ imagined futures, the highest-peak-reward future becomes a conditioning input, and the policy is re-queried for a corrected chunk.
- LeRobot
- HuggingFace’s open-source robotics dataset format and Python toolkit. The de-facto standard for sharing manipulation demonstrations. Midcore captures and exports in LeRobot 0.5.1 format.
- OpenPI
- The open WebSocket policy protocol from Physical Intelligence. The wire contract every modern VLA implements.
policy.infer(obs)returns an action chunk plus auxiliary signals (RCS, LAR applied, latency). - Policy
- The function mapping observation to action. In modern manipulation that’s a learned neural network — usually a VLA — running behind an OpenPI server.
- Quaternion
- A four-number representation of a rotation. No gimbal lock; SLERP-interpolatable. Midcore uses the xyzw convention.
- RCS (Re-denoising Consistency Score)
- τ₀-WM’s built-in confidence signal. Re-noises the proposed action chunk, denoises it again, and scores the consistency. High score = the candidate lies on the policy’s learned manifold; low score = it doesn’t.
- RobotDefinition
- Midcore’s in-memory representation of a robot — morphology (links + joints), sensors, actuators, controller hints. The unit of editing in the Designer.
- 6D continuous rotation
- Zhou et al.’s rotation encoding for neural networks (CVPR 2019). The first two columns of the rotation matrix, flattened. Continuous everywhere; the standard choice for regression heads. τ₀-WM trains on this internally and converts to quaternion at the wire boundary.
- τ₀-WM
- The unified video-action world model published by Shanghai Innovation Institute and AGIBOT Finch in May 2026 under Apache 2.0. 5.5 B parameters (5 B video DiT + 0.5 B action DiT); pre-trained on 27,300 hours of mixed manipulation video. The default policy provider Midcore integrates.
- UMI (Universal Manipulation Interface)
- A capture rig — handheld parallel gripper plus camera plus IMU — that a person uses to demonstrate manipulation tasks outside of any robot context. The recorded demonstrations transfer to robot embodiments because the action signal is gripper-centric. 6,500 hours of UMI data went into τ₀-WM’s pre-training.
- URDF
- Unified Robot Description Format. XML-based robot description that’s the lingua franca of the ROS ecosystem. Imports into Designer; can be exported from a Designer’s robot definition.
- VAM (Video Action Model)
- The first half of τ₀-WM: the dual-branch DiT transformer that predicts both a future visual latent trajectory and an executable action chunk.
- VLA (Vision-Language-Action)
- Class of models that take RGB pixels and a natural-language prompt as input, emit robot actions as output. RT-1, RT-2, π0, OpenVLA, τ₀-WM are all VLAs to varying degrees of sophistication.
- Wan2.2-TI2V-5B
- The 5 B-parameter text-and-image-to-video diffusion model that τ₀-WM’s video branch is cloned from. Apache 2.0, published by Wan-Video.
- World model
- A learned predictor of future observations, states, or rewards conditional on a candidate action. Replaces the analytically modelled simulators of classical robotics with a data-driven equivalent that handles contact, semantics, and multi-step structure.
Missing a term? File a docs feedback note and we’ll add it.