Skip to main content

Glossary

Every term used in the robotics knowledge section, defined. Sorted alphabetically. Cross-linked from every other page.

Action chunk
A short sequence of future actions ({T, action_dim}) predicted in one inference call. Lets the policy and the low-level controller run independently between calls. τ₀-WM emits 16-channel chunks (left/right EE + grippers) over a typical horizon of 16 timesteps.
ACVS (Action-Conditioned Video Simulator)
The second model in the τ₀-WM family. Reuses the video backbone, removes the action branch, and conditions on a candidate action chunk to predict an imagined latent rollout plus a per-frame reward trajectory. Acts as a learned simulator at inference time.
Arm-base frame
The coordinate frame whose origin is the first non-moving link of an arm. τ₀-WM and most VLA models express end-effector poses in arm-base frames, not the world frame.
DOF (Degrees of Freedom)
The number of independent dimensions in which a mechanism can move. A standard 6-DOF arm can reach any pose; a 7-DOF cobot has redundancy for elbow position; a humanoid usually has 20–30 DOF.
End-effector (EE)
The business end of a manipulator — usually a gripper or tool. The EE pose (xyz + orientation) is what the policy commits to; joint angles are downstream details handled by inverse kinematics.
Episode
One continuous attempt at a task: from start to either success or failure. The unit of dataset capture.
Flow matching
A generative-model training technique. The model learns a velocity field that transports samples from a noise distribution to the data distribution along straight-line paths. Successor to denoising diffusion; what τ₀-WM uses for both its video and action heads.
Forward / inverse kinematics (FK / IK)
FK maps joint angles to an EE pose (always solvable). IK does the reverse (multiple solutions, harder).
γ (gating threshold)
The RCS score above which a proposed action chunk is considered commit-ready. Default 0.6. Below γ but above the hard floor, the operator must explicitly force-confirm.
Hard floor
The RCS score below which a proposed action chunk is blocked entirely. Default 0.2. The proposal is still recorded on the audit log for review.
LAR (Low-quality Action Rectification)
τ₀-WM’s one-shot correction loop. When RCS lands in the gated regime, ACVS scores all N candidates’ imagined futures, the highest-peak-reward future becomes a conditioning input, and the policy is re-queried for a corrected chunk.
LeRobot
HuggingFace’s open-source robotics dataset format and Python toolkit. The de-facto standard for sharing manipulation demonstrations. Midcore captures and exports in LeRobot 0.5.1 format.
OpenPI
The open WebSocket policy protocol from Physical Intelligence. The wire contract every modern VLA implements. policy.infer(obs) returns an action chunk plus auxiliary signals (RCS, LAR applied, latency).
Policy
The function mapping observation to action. In modern manipulation that’s a learned neural network — usually a VLA — running behind an OpenPI server.
Quaternion
A four-number representation of a rotation. No gimbal lock; SLERP-interpolatable. Midcore uses the xyzw convention.
RCS (Re-denoising Consistency Score)
τ₀-WM’s built-in confidence signal. Re-noises the proposed action chunk, denoises it again, and scores the consistency. High score = the candidate lies on the policy’s learned manifold; low score = it doesn’t.
RobotDefinition
Midcore’s in-memory representation of a robot — morphology (links + joints), sensors, actuators, controller hints. The unit of editing in the Designer.
6D continuous rotation
Zhou et al.’s rotation encoding for neural networks (CVPR 2019). The first two columns of the rotation matrix, flattened. Continuous everywhere; the standard choice for regression heads. τ₀-WM trains on this internally and converts to quaternion at the wire boundary.
τ₀-WM
The unified video-action world model published by Shanghai Innovation Institute and AGIBOT Finch in May 2026 under Apache 2.0. 5.5 B parameters (5 B video DiT + 0.5 B action DiT); pre-trained on 27,300 hours of mixed manipulation video. The default policy provider Midcore integrates.
UMI (Universal Manipulation Interface)
A capture rig — handheld parallel gripper plus camera plus IMU — that a person uses to demonstrate manipulation tasks outside of any robot context. The recorded demonstrations transfer to robot embodiments because the action signal is gripper-centric. 6,500 hours of UMI data went into τ₀-WM’s pre-training.
URDF
Unified Robot Description Format. XML-based robot description that’s the lingua franca of the ROS ecosystem. Imports into Designer; can be exported from a Designer’s robot definition.
VAM (Video Action Model)
The first half of τ₀-WM: the dual-branch DiT transformer that predicts both a future visual latent trajectory and an executable action chunk.
VLA (Vision-Language-Action)
Class of models that take RGB pixels and a natural-language prompt as input, emit robot actions as output. RT-1, RT-2, π0, OpenVLA, τ₀-WM are all VLAs to varying degrees of sophistication.
Wan2.2-TI2V-5B
The 5 B-parameter text-and-image-to-video diffusion model that τ₀-WM’s video branch is cloned from. Apache 2.0, published by Wan-Video.
World model
A learned predictor of future observations, states, or rewards conditional on a candidate action. Replaces the analytically modelled simulators of classical robotics with a data-driven equivalent that handles contact, semantics, and multi-step structure.

Missing a term? File a docs feedback note and we’ll add it.