MobileCLIP-S2

19%

by Apple

Apple's reference vision-language embedding for mobile. The encoder behind a lot of on-device retrieval.

MultimodalApache-2.0INT8FP16clipembeddingmobile

313K downloads 18K deploymentsUpdated Mar 30, 2028

Headline:6.4ms · Apple Neural Engine (M4) · INT8

Overview Benchmarks4 Sim Results Deploy5 Files Discussion23

About this model

Apple's reference vision-language embedding for mobile. The encoder behind a lot of on-device retrieval.

Authored by apple. Curated into the Fo’c’sle reference set on 2028-03-30. All cross-chip benchmarks below were collected in matched-pair runs in the HIL lab using the same input pipeline, same upstream preprocessing, and the same downstream consumer. See the methodology page for the full protocol.

Task: Multimodal
Parameters: 35.6 M
Benchmarked on: 4 chips
Deployments: 18K

Architecture

Vision-text dual encoder

Inferred from upstream weights · simplified

Headline benchmarks

HHailo-10HINT8

20.4ms p50

49 FPS98.3% acc3.2 W

NNVIDIA Jetson Orin NanoINT8

21.4ms p50

47 FPS97.8% acc12.2 W

QSnapdragon 8 Gen 3 NPUINT8

21.7ms p50

46 FPS97.6% acc7.4 W

Training data

Pretrained on the upstream maintainer’s released checkpoint. Edge-distillation pass uses 2.4M frames from the Fo’c’sle distillation corpus (consented public data + opt-in publisher contributions). Quantization-aware fine-tune uses 320K calibration samples drawn from the target task’s eval domain.

Pretraining corpus: upstream maintainer release
Distillation corpus: 2,400,000 frames
Calibration set: 320,000 samples (per task)
Eval set: standard benchmark + matched-pair HIL runs