Florence-2-Edge

12%

by Microsoft

Florence-2 distilled for edge inspection use cases — caption, detect, segment from a unified head.

MultimodalMITINT8FP16vlmunifiedindustrial

76K downloads 2.8K deploymentsUpdated Mar 2, 2028

Headline:84ms · NVIDIA Jetson Orin Nano · INT8

Overview Benchmarks5 Sim Results Deploy5 Files Discussion23

About this model

Florence-2 distilled for edge inspection use cases — caption, detect, segment from a unified head.

Authored by microsoft. Curated into the Fo’c’sle reference set on 2028-03-02. All cross-chip benchmarks below were collected in matched-pair runs in the HIL lab using the same input pipeline, same upstream preprocessing, and the same downstream consumer. See the methodology page for the full protocol.

Task: Multimodal
Parameters: 230 M
Benchmarked on: 5 chips
Deployments: 2.8K

Architecture

Vision-text dual encoder

Inferred from upstream weights · simplified

Headline benchmarks

NNVIDIA Jetson AGX OrinFP16

31.3ms p50

32 FPS99.6% acc44.8 W

QSnapdragon 8 Gen 3 NPUINT8

42.3ms p50

24 FPS98.3% acc7.5 W

QQualcomm QCS8550INT8

58.5ms p50

17 FPS98.6% acc12.2 W

Training data

Pretrained on the upstream maintainer’s released checkpoint. Edge-distillation pass uses 2.4M frames from the Fo’c’sle distillation corpus (consented public data + opt-in publisher contributions). Quantization-aware fine-tune uses 320K calibration samples drawn from the target task’s eval domain.

Pretraining corpus: upstream maintainer release
Distillation corpus: 2,400,000 frames
Calibration set: 320,000 samples (per task)
Eval set: standard benchmark + matched-pair HIL runs