Glossary

Every term used in the robotics knowledge section, defined. Sorted alphabetically. Cross-linked from every other page.

Action chunk: A short sequence of future actions ({T, action_dim}) predicted in one inference call. Lets the policy and the low-level controller run independently between calls. τ₀-WM emits 16-channel chunks (left/right EE + grippers) over a typical horizon of 16 timesteps.
ACVS (Action-Conditioned Video Simulator): The second model in the τ₀-WM family. Reuses the video backbone, removes the action branch, and conditions on a candidate action chunk to predict an imagined latent rollout plus a per-frame reward trajectory. Acts as a learned simulator at inference time.
Arm-base frame: The coordinate frame whose origin is the first non-moving link of an arm. τ₀-WM and most VLA models express end-effector poses in arm-base frames, not the world frame.
DOF (Degrees of Freedom): The number of independent dimensions in which a mechanism can move. A standard 6-DOF arm can reach any pose; a 7-DOF cobot has redundancy for elbow position; a humanoid usually has 20–30 DOF.
End-effector (EE): The business end of a manipulator — usually a gripper or tool. The EE pose (xyz + orientation) is what the policy commits to; joint angles are downstream details handled by inverse kinematics.
Episode: One continuous attempt at a task: from start to either success or failure. The unit of dataset capture.
Flow matching: A generative-model training technique. The model learns a velocity field that transports samples from a noise distribution to the data distribution along straight-line paths. Successor to denoising diffusion; what τ₀-WM uses for both its video and action heads.
Forward / inverse kinematics (FK / IK): FK maps joint angles to an EE pose (always solvable). IK does the reverse (multiple solutions, harder).
γ (gating threshold): The RCS score above which a proposed action chunk is considered commit-ready. Default 0.6. Below γ but above the hard floor, the operator must explicitly force-confirm.
Hard floor: The RCS score below which a proposed action chunk is blocked entirely. Default 0.2. The proposal is still recorded on the audit log for review.
LAR (Low-quality Action Rectification): τ₀-WM’s one-shot correction loop. When RCS lands in the gated regime, ACVS scores all N candidates’ imagined futures, the highest-peak-reward future becomes a conditioning input, and the policy is re-queried for a corrected chunk.
LeRobot: HuggingFace’s open-source robotics dataset format and Python toolkit. The de-facto standard for sharing manipulation demonstrations. Midcore captures and exports in LeRobot 0.5.1 format.
OpenPI: The open WebSocket policy protocol from Physical Intelligence. The wire contract every modern VLA implements. policy.infer(obs) returns an action chunk plus auxiliary signals (RCS, LAR applied, latency).
Policy: The function mapping observation to action. In modern manipulation that’s a learned neural network — usually a VLA — running behind an OpenPI server.
Quaternion: A four-number representation of a rotation. No gimbal lock; SLERP-interpolatable. Midcore uses the xyzw convention.
RCS (Re-denoising Consistency Score): τ₀-WM’s built-in confidence signal. Re-noises the proposed action chunk, denoises it again, and scores the consistency. High score = the candidate lies on the policy’s learned manifold; low score = it doesn’t.
RobotDefinition: Midcore’s in-memory representation of a robot — morphology (links + joints), sensors, actuators, controller hints. The unit of editing in the Designer.
6D continuous rotation: Zhou et al.’s rotation encoding for neural networks (CVPR 2019). The first two columns of the rotation matrix, flattened. Continuous everywhere; the standard choice for regression heads. τ₀-WM trains on this internally and converts to quaternion at the wire boundary.
τ₀-WM: The unified video-action world model published by Shanghai Innovation Institute and AGIBOT Finch in May 2026 under Apache 2.0. 5.5 B parameters (5 B video DiT + 0.5 B action DiT); pre-trained on 27,300 hours of mixed manipulation video. The default policy provider Midcore integrates.
UMI (Universal Manipulation Interface): A capture rig — handheld parallel gripper plus camera plus IMU — that a person uses to demonstrate manipulation tasks outside of any robot context. The recorded demonstrations transfer to robot embodiments because the action signal is gripper-centric. 6,500 hours of UMI data went into τ₀-WM’s pre-training.
URDF: Unified Robot Description Format. XML-based robot description that’s the lingua franca of the ROS ecosystem. Imports into Designer; can be exported from a Designer’s robot definition.
VAM (Video Action Model): The first half of τ₀-WM: the dual-branch DiT transformer that predicts both a future visual latent trajectory and an executable action chunk.
VLA (Vision-Language-Action): Class of models that take RGB pixels and a natural-language prompt as input, emit robot actions as output. RT-1, RT-2, π0, OpenVLA, τ₀-WM are all VLAs to varying degrees of sophistication.
Wan2.2-TI2V-5B: The 5 B-parameter text-and-image-to-video diffusion model that τ₀-WM’s video branch is cloned from. Apache 2.0, published by Wan-Video.
World model: A learned predictor of future observations, states, or rewards conditional on a candidate action. Replaces the analytically modelled simulators of classical robotics with a data-driven equivalent that handles contact, semantics, and multi-step structure.

Missing a term? File a docs feedback note and we’ll add it.