Methodology

Inside the HIL lab: methodology for cross-vendor robotics benchmarks

What it means to call a number 'comparable' across silicon vendors — the rig, the harness, the protocol, and the ways we keep ourselves honest.

Fo’c’sle·Mar 28, 2028·14 min read

The first principle of cross-vendor benchmarking is that the comparison only matters if the workload actually matches what the customer is going to deploy. In practice this means: real input resolutions; real preprocessing pipelines; real downstream consumers waiting on the output. Synthetic mini-batched throughput numbers — the kind every chip vendor's own product page leads with — overstate field performance by between 1.4× and 3.2× in our matched-pair runs. Here is how we structure ours so the numbers survive contact with a customer's hardware.

Our protocol is built around three rules. First, every measurement is a matched-pair: the same input frames, the same downstream consumer, the same warm-up window, on the same physical rig. The chip changes; nothing else does. Second, every measurement is steady-state: 30 seconds of warm-up, then a 10,000-frame measurement window, with the rig instrumented at the package-power level. Third, every measurement is reproducible: the rig configuration, input data, downstream consumer code, and instrumentation harness are checked into the same git repository the published number lives in.

Submissions from publishers go through a reproduction gate. We accept publisher-submitted benchmarks only if they reproduce within ±8% of an HIL-lab matched-pair run on the same chip in the same input mode. Failed reproductions don't get rejected — they go into a separate Discussion section on the model page, with both numbers visible and the reproduction harness alongside, so practitioners can decide for themselves which one matches their conditions.

We're public about the limits of this protocol. It works well for vision workloads — object detection, depth, segmentation — where the input pipeline is unambiguous and the consumer is deterministic. It's harder for VLA, where the policy's downstream behavior depends on closed-loop dynamics that you can only really observe in HIL sim. That's why every VLA model on this site has a separate Sim Results tab — the matched-pair numbers tell you about the inference cost, but only the HIL sim numbers tell you whether the policy actually works under load.

The protocol will keep evolving. We'll publish version bumps when it does, with the change history and the affected number ranges public on the methodology changelog. If you have a workload class that the current protocol doesn't fit, please send us a note — there are at least three more we know we need to design for in the next six months.

Written by Fo’c’sle — published on the Fo’c’sle changelog.

How Meridian Autonomy compressed chip selection from 14 weeks to 9 days

Releasing Depth-Anything-Edge: a 41 MB depth model that runs at 60 FPS on a Pi 5