Zero-Shot vs. Few-Shot Robot Policies: Setting Realistic Expectations

Definitions

Zero-shot means executing a new task or handling a new object with no new demonstrations — the robot relies entirely on what it learned during pre-training. Few-shot means adapting to a new task with 5-50 new demonstrations. These are meaningfully different capabilities, and conflating them leads to unrealistic planning.

The Current Reality for Zero-Shot

Foundation models for robot manipulation — OpenVLA, Octo, RT-2 — have demonstrated genuine zero-shot capability on simple tasks within their training distribution. The results that have been validated and reproduced across multiple labs:

Open-vocabulary object detection + simple top-down grasp planning works zero-shot for ~60% of common household objects presented in upright orientation on a clear surface
Language-conditioned navigation in previously explored environments works zero-shot with ~75% success in structured settings
Pick-and-place with familiar object categories (mugs, bottles, blocks) achieves 50-65% success zero-shot with OpenVLA on standard benchmarks

Where zero-shot reliably fails: precision tasks requiring sub-5mm placement, dexterous manipulation, novel tool use, tasks where the object pose is non-canonical (tilted bottles, stacked cups), and any task involving deformable objects not well-represented in training data.

Few-Shot Fine-Tuning: The More Practical Capability

Few-shot fine-tuning (20-100 demonstrations on a new task) is where foundation models show their clearest practical value. The comparison that matters:

Training Approach	Demos Required	Typical Success Rate	Time to Train
Foundation model, zero-shot	0	30–65% (simple tasks)	N/A
Foundation model + 20 demo fine-tune	20	70–80%	30 min GPU
Foundation model + 100 demo fine-tune	100	80–90%	2 hr GPU
Train from scratch (ACT)	500	75–88%	3–4 hr GPU
Train from scratch (Diffusion)	1000	82–92%	8–12 hr GPU

The foundation model advantage is most pronounced in the low-data regime (under 100 demonstrations). With 20 demonstrations, a fine-tuned foundation model achieves success rates comparable to training ACT from scratch on 500 demonstrations. That is a 25× data efficiency improvement — which translates directly to cost and time savings.

Where Zero-Shot Actually Works Today

Structured picking with clear object detection: DETIC + GraspNet-style open-vocabulary detection + simple grasp planner works zero-shot for regular objects in organized bins. This is production-ready today for e-commerce and logistics.
Language-conditioned navigation in known spaces: VLN (Vision-Language Navigation) models work zero-shot in spaces they were trained to understand, with good generalization to same-layout spaces in different buildings.
Object recognition and sorting by category: Language-conditioned sorting ("put the red items in the left bin") works zero-shot for known categories with RGB classification.

Where It Does Not (Yet)

Contact-rich manipulation: Peg-in-hole, snap connectors, folding fabric, unstacking cups — zero-shot success rates are 10-30% for current foundation models. Not reliable for production.
Novel tool use: Using an unfamiliar tool (a can opener, a specific screwdriver) zero-shot is not yet reliable. Few-shot (20-50 demos) works.
Dexterous manipulation: In-hand re-grasping, rotation of objects using finger control — outside current zero-shot capability for all production models.

Foundation Model Benchmark Results (2025-2026)

Model	Provider	Zero-Shot (SimplerEnv)	20-Shot Fine-Tune	Parameters
OpenVLA	Stanford/TRI	48-62%	72-80%	7B
Octo	Berkeley	35-55%	65-78%	93M
pi-0	Physical Intelligence	55-70%	78-88%	3B
RT-2-X	Google DeepMind	50-65%	75-85%	55B
ACT (from scratch)	Stanford	N/A	40-55% (20 demos)	12M

These numbers represent performance on simple manipulation benchmarks (pick-place, drawer opening, button pressing) in controlled lab environments. Real-world deployment numbers are typically 10-20 percentage points lower due to lighting variation, background clutter, and object diversity not represented in benchmark conditions.

Few-Shot Data Efficiency Curves

The most valuable insight for practitioners is how performance scales with the number of fine-tuning demonstrations. Based on published results and SVRC's internal evaluations:

5 demonstrations: Foundation models show 5-15% improvement over zero-shot. Training from scratch produces unreliable policies. The foundation model advantage is largest here -- roughly 10x more data-efficient than scratch training.
20 demonstrations: The sweet spot for foundation model fine-tuning. Most models achieve 70-80% of their final performance at this point. From scratch, ACT typically reaches 40-55%. The gap between pre-trained and scratch is 20-30 percentage points.
50 demonstrations: Foundation models approach their ceiling (80-88%). Scratch-trained models close the gap, reaching 60-75% with well-collected data. The cost difference is significant: 50 demos costs $200-500 at SVRC rates vs. the months of pre-training compute invested in the foundation model.
100 demonstrations: Foundation model fine-tuning reaches diminishing returns for most tasks. Scratch-trained models catch up further (75-85%). The practical question becomes whether the remaining performance gap justifies the complexity of using a 3-7B parameter model for inference.
500 demonstrations: Scratch-trained ACT and Diffusion Policy typically match or exceed fine-tuned foundation models. At this data volume, the advantage of pre-training is minimal for single-task deployment. Foundation models retain an advantage for multi-task generalization.

When to Use Which Approach: Decision Matrix

Scenario	Recommendation	Why
Quick feasibility test, simple task	Zero-shot with OpenVLA/pi-0	No data cost; instant evaluation
New task, limited budget (<$2K data)	Foundation model + 20-50 demo fine-tune	Maximum performance per dollar
Production deployment, single task	ACT/Diffusion from scratch, 300-500 demos	Smaller model = faster inference, simpler deployment
Multi-task deployment (10+ tasks)	Foundation model + per-task fine-tuning	Shared backbone amortizes model cost
Precision task (sub-2mm tolerance)	Scratch Diffusion Policy, 500+ demos	Foundation models lack precision for tight tolerances
Edge compute (Jetson, no GPU server)	Small model from scratch (ACT 12M params)	7B VLA models cannot run on edge devices

The Embodiment Gap: Why Zero-Shot Is Harder Than It Looks

Zero-shot performance in NLP improves reliably with model scale: a 70B language model is consistently better at zero-shot text tasks than a 7B model. The same scaling relationship does not hold for robot foundation models, and understanding why is critical for setting realistic expectations.

The fundamental challenge is the embodiment gap: unlike text (which has a universal tokenization), robot actions are embodiment-specific. A 7-DOF joint velocity command for a Franka Research 3 is meaningless for a 6-DOF OpenArm or a mobile manipulator with a different kinematic chain. Current foundation models handle this through either action space normalization (mapping all robots to a shared 7-DOF representation, which loses information for robots with more or fewer DOF) or embodiment-specific output heads (which require at least a few demonstrations on the target embodiment to calibrate the output head).

This means that "zero-shot" on a new robot embodiment is functionally impossible with current architectures -- you always need at least a calibration step. The zero-shot capability that works today is zero-shot to new tasks on the same embodiment that the model was trained on. Transfer to a new embodiment is always few-shot at minimum.

The perception gap is the second major challenge. Foundation models are trained on datasets collected in specific labs with specific cameras, lighting, and backgrounds. When deployed in a new environment with different visual conditions, the visual encoder's representations shift, degrading policy performance. This is why the published zero-shot numbers (collected in labs similar to the training data) are 10-20 percentage points higher than real-world numbers. Domain randomization during training and vision-language pre-training both help, but neither fully closes this gap today.

The action distribution gap. Different tasks have fundamentally different action distributions -- pick-and-place involves discrete grasp/release events, while wiping involves continuous contact maintenance. A foundation model trained primarily on pick-and-place data will have poor zero-shot performance on contact-rich tasks even if the visual understanding transfers perfectly. This is why task diversity in the pre-training data matters more than dataset size for zero-shot generalization.

Inference Cost and Latency Comparison

A factor that is often overlooked in the zero-shot vs. few-shot discussion: the inference cost of running foundation models in production. A 7B-parameter VLA model requires a dedicated GPU for inference (A10G minimum, ~$0.75/hour on cloud) and introduces 100-300ms latency per action. A 12M-parameter ACT model runs on a $200 Jetson Orin Nano at 5-15ms per action.

For deployment at scale (10+ robots), the GPU inference cost of foundation models can exceed $50,000/year -- potentially more than the cost of collecting enough data to train smaller task-specific models. This is the counterintuitive economic argument: investing in data collection (a one-time cost) can be cheaper than paying ongoing inference costs for foundation models.

Practical Guidance

Plan for 200-500 demonstrations for any new task even with foundation models. Zero-shot performance is a bonus to be measured, not a baseline to be assumed. If zero-shot achieves 60%+ success on your task, consider yourself ahead of schedule. If it achieves 30%, proceed with your planned fine-tuning data collection.

The SVRC data services team can assess your specific task for zero-shot viability and recommend a realistic demonstration budget before you commit to a collection timeline.