Datasets

The only thing that reliably improves a manipulation policy on your specific task is data of that task. This page explains the format Midcore captures into, the three data sources that work for VLA-class models, and how to author a clean episode without polluting the training corpus.

The LeRobot 0.5.1 capture format

HuggingFace’s LeRobot format is the de-facto standard for robot demonstration datasets. It’s a parquet-on-disk layout with sidecar mp4 video files and JSON metadata. Midcore writes captures into this format verbatim so any LeRobot consumer (other VLAs, dataset registries, fine-tune frameworks) reads them with zero translation.

  <dataset_root>/
    meta/info.json                — schema, features, fps, splits
    meta/episodes.jsonl           — one row per episode
    meta/stats.json               — per-feature running statistics
    data/chunk-000/
      episode_000000.parquet      — per-frame state + action + meta
      episode_000001.parquet
      ...
    videos/chunk-000/
      observation.images.cam_high/
        episode_000000.mp4
      observation.images.cam_low/
        episode_000000.mp4

The τ₀-WM feature specification

Each frame row in the parquet shard carries the following fields (you don’t hand-author these — the capture pipeline fills them in):

Feature	Dtype	Shape	Notes
observation.state	float32	[14]	left_ee xyz+quat + right_ee xyz+quat, arm-base frame
observation.gripper	float32	[2]	left + right opening, [0, 120]
observation.images.cam_high	uint8	[3, 192, 256]	RGB frame; the trained τ₀-WM checkpoint expects this resolution
action	float32	[16]	6+1 channels per arm + gripper openness on [0, 1]

Episode metadata

Each episode is one continuous attempt at a task. The metadata record tracks who collected it, what the natural-language task was, how long it ran, and where it came from in the source taxonomy below.

  {
    "episode_index": 12,
    "tasks": ["fold towel", "place on shelf"],
    "length": 480,
    "captured_at_ms": 1717286401234,
    "source": "teleop" | "umi" | "human_video" | "synthetic",
    "operator": "yasser",
    "notes": "second attempt; first slip was operator error"
  }

The three data sources that work for VLA fine-tuning

Not all robot data is equal. The τ₀-WM pre-training corpus mixes three sources because each contributes something the others can’t.

Teleoperation episodes

A human pilots the robot via VR controllers, master/slave arm rig, or 3D mouse. The episode captures the human’s commanded EE trajectory plus the resulting sensor stream.

Pro: action-grounded ground truth on the actual embodiment.
Pro: high enough quality that 50–200 episodes usually move the needle on a target task.
Con: expensive — a skilled teleoperator costs ~$50–100/hour and produces maybe 60–120 episodes/day.
Con: the human’s jerky corrections become noise the model has to learn to ignore. Better operators = better data.

UMI-style handheld captures

A person walks around with a handheld parallel gripper rig (camera + IMU + the gripper itself) and demonstrates the task outside of any robot context. UMI (Universal Manipulation Interface) is the most popular instance.

Pro: radically cheaper per-episode. One person, no robot scheduling.
Pro: the demonstrations generalise across embodiments because the action signal is gripper-centric, not joint-centric.
Con: requires the handheld rig hardware. Not available off-the-shelf in every form factor.
Con: human-arm biomechanics differ from robot arms; weird configurations (e.g., overhead reach) may not transfer.

Egocentric human video

Head-mounted camera footage of humans doing tasks. No action labels — the action loss is masked out during training, but the video loss still applies. Bulk visual diversity at near-zero capture cost.

Pro: public datasets like Egodex, Egoverse, Xperience-10M aggregate thousands of hours.
Con: no action supervision — only useful as scale data for the video branch.
Con: embodiment gap (humans aren’t robots) means careful loss masking is essential.

Capture discipline that matters

Common mistakes that quietly tank a fine-tune:

Camera drift mid-capture. The model assumes the camera mount is fixed. A jiggled tripod becomes a learned latent.
Lighting shifts between episodes. Especially if you capture morning vs afternoon — the model will overfit to time of day. Either keep lighting constant or capture a full diurnal sweep.
Operator switching mid-task. Two operators have different jerk profiles. Mixing them is fine; mixing them in the same episode is not.
Action-state lag. If your teleop rig has 200 ms of latency, the recorded action and state are misaligned by 6 frames at 30 fps. Always time-sync at the start of every session.
Missing failure recoveries. A corpus of only successes teaches a brittle policy. Capture the slip — the reach back — the second attempt. Mark them source: failure in the metadata.

Target: 100 to 500 episodes for a domain fine-tune

Empirically, a household-scale task (towel fold, dishrack loading, drawer open) needs in the order of 100–500 teleop episodes to land a useful fine-tune on top of τ₀-WM’s pre-trained weights. Industrial-scale tasks with narrow tolerances (precision insertion, fastener tightening) usually need more, plus tactile sensing the base model doesn’t support yet.

How Midcore records an episode

The Intake screen exposes a “LeRobot demo capture” card. Walk-through:

Pick a robot type (defaults to dual_arm_franka).
Name the dataset (e.g., towel-fold-bimanual) and set the capture FPS (default 30).
Click Create dataset. The Intake card now shows it as “recording.”
Begin teleoperation. Each captured frame is appended via the dataset’s frame endpoint.
At the end of each attempt, click End episode and tag its task description.
When you’ve captured enough episodes, click Finalize. The per-feature statistics get computed and the dataset is locked for use in Training.

Datasets live inside your project

Each capture is stored inside the active project’s vault alongside the rest of your robotics artifacts. Project boundaries keep customers’ data isolated; the audit log records every capture + finalize event so you can prove what was recorded and when.

Ready to train?

Once the dataset is finalised, head to Training for the cost structure of a fine-tune and the practical operating procedure.