Definitions

Zero-shot means executing a new task or handling a new object with no new demonstrations — the robot relies entirely on what it learned during pre-training. Few-shot means adapting to a new task with 5-50 new demonstrations. These are meaningfully different capabilities, and conflating them leads to unrealistic planning.

The Current Reality for Zero-Shot

Foundation models for robot manipulation — OpenVLA, Octo, RT-2 — have demonstrated genuine zero-shot capability on simple tasks within their training distribution. The results that have been validated and reproduced across multiple labs:

  • Open-vocabulary object detection + simple top-down grasp planning works zero-shot for ~60% of common household objects presented in upright orientation on a clear surface
  • Language-conditioned navigation in previously explored environments works zero-shot with ~75% success in structured settings
  • Pick-and-place with familiar object categories (mugs, bottles, blocks) achieves 50-65% success zero-shot with OpenVLA on standard benchmarks

Where zero-shot reliably fails: precision tasks requiring sub-5mm placement, dexterous manipulation, novel tool use, tasks where the object pose is non-canonical (tilted bottles, stacked cups), and any task involving deformable objects not well-represented in training data.

Few-Shot Fine-Tuning: The More Practical Capability

Few-shot fine-tuning (20-100 demonstrations on a new task) is where foundation models show their clearest practical value. The comparison that matters:

Training ApproachDemos RequiredTypical Success RateTime to Train
Foundation model, zero-shot030–65% (simple tasks)N/A
Foundation model + 20 demo fine-tune2070–80%30 min GPU
Foundation model + 100 demo fine-tune10080–90%2 hr GPU
Train from scratch (ACT)50075–88%3–4 hr GPU
Train from scratch (Diffusion)100082–92%8–12 hr GPU

The foundation model advantage is most pronounced in the low-data regime (under 100 demonstrations). With 20 demonstrations, a fine-tuned foundation model achieves success rates comparable to training ACT from scratch on 500 demonstrations. That is a 25× data efficiency improvement — which translates directly to cost and time savings.

Where Zero-Shot Actually Works Today

  • Structured picking with clear object detection: DETIC + GraspNet-style open-vocabulary detection + simple grasp planner works zero-shot for regular objects in organized bins. This is production-ready today for e-commerce and logistics.
  • Language-conditioned navigation in known spaces: VLN (Vision-Language Navigation) models work zero-shot in spaces they were trained to understand, with good generalization to same-layout spaces in different buildings.
  • Object recognition and sorting by category: Language-conditioned sorting ("put the red items in the left bin") works zero-shot for known categories with RGB classification.

Where It Does Not (Yet)

  • Contact-rich manipulation: Peg-in-hole, snap connectors, folding fabric, unstacking cups — zero-shot success rates are 10-30% for current foundation models. Not reliable for production.
  • Novel tool use: Using an unfamiliar tool (a can opener, a specific screwdriver) zero-shot is not yet reliable. Few-shot (20-50 demos) works.
  • Dexterous manipulation: In-hand re-grasping, rotation of objects using finger control — outside current zero-shot capability for all production models.

Foundation Model Benchmark Results (2025-2026)

ModelProviderZero-Shot (SimplerEnv)20-Shot Fine-TuneParameters
OpenVLAStanford/TRI48-62%72-80%7B
OctoBerkeley35-55%65-78%93M
pi-0Physical Intelligence55-70%78-88%3B
RT-2-XGoogle DeepMind50-65%75-85%55B
ACT (from scratch)StanfordN/A40-55% (20 demos)12M

These numbers represent performance on simple manipulation benchmarks (pick-place, drawer opening, button pressing) in controlled lab environments. Real-world deployment numbers are typically 10-20 percentage points lower due to lighting variation, background clutter, and object diversity not represented in benchmark conditions.

Few-Shot Data Efficiency Curves

The most valuable insight for practitioners is how performance scales with the number of fine-tuning demonstrations. Based on published results and SVRC's internal evaluations:

  • 5 demonstrations: Foundation models show 5-15% improvement over zero-shot. Training from scratch produces unreliable policies. The foundation model advantage is largest here -- roughly 10x more data-efficient than scratch training.
  • 20 demonstrations: The sweet spot for foundation model fine-tuning. Most models achieve 70-80% of their final performance at this point. From scratch, ACT typically reaches 40-55%. The gap between pre-trained and scratch is 20-30 percentage points.
  • 50 demonstrations: Foundation models approach their ceiling (80-88%). Scratch-trained models close the gap, reaching 60-75% with well-collected data. The cost difference is significant: 50 demos costs $200-500 at SVRC rates vs. the months of pre-training compute invested in the foundation model.
  • 100 demonstrations: Foundation model fine-tuning reaches diminishing returns for most tasks. Scratch-trained models catch up further (75-85%). The practical question becomes whether the remaining performance gap justifies the complexity of using a 3-7B parameter model for inference.
  • 500 demonstrations: Scratch-trained ACT and Diffusion Policy typically match or exceed fine-tuned foundation models. At this data volume, the advantage of pre-training is minimal for single-task deployment. Foundation models retain an advantage for multi-task generalization.

When to Use Which Approach: Decision Matrix

ScenarioRecommendationWhy
Quick feasibility test, simple taskZero-shot with OpenVLA/pi-0No data cost; instant evaluation
New task, limited budget (<$2K data)Foundation model + 20-50 demo fine-tuneMaximum performance per dollar
Production deployment, single taskACT/Diffusion from scratch, 300-500 demosSmaller model = faster inference, simpler deployment
Multi-task deployment (10+ tasks)Foundation model + per-task fine-tuningShared backbone amortizes model cost
Precision task (sub-2mm tolerance)Scratch Diffusion Policy, 500+ demosFoundation models lack precision for tight tolerances
Edge compute (Jetson, no GPU server)Small model from scratch (ACT 12M params)7B VLA models cannot run on edge devices

The Embodiment Gap: Why Zero-Shot Is Harder Than It Looks

Zero-shot performance in NLP improves reliably with model scale: a 70B language model is consistently better at zero-shot text tasks than a 7B model. The same scaling relationship does not hold for robot foundation models, and understanding why is critical for setting realistic expectations.

The fundamental challenge is the embodiment gap: unlike text (which has a universal tokenization), robot actions are embodiment-specific. A 7-DOF joint velocity command for a Franka Research 3 is meaningless for a 6-DOF OpenArm or a mobile manipulator with a different kinematic chain. Current foundation models handle this through either action space normalization (mapping all robots to a shared 7-DOF representation, which loses information for robots with more or fewer DOF) or embodiment-specific output heads (which require at least a few demonstrations on the target embodiment to calibrate the output head).

This means that "zero-shot" on a new robot embodiment is functionally impossible with current architectures -- you always need at least a calibration step. The zero-shot capability that works today is zero-shot to new tasks on the same embodiment that the model was trained on. Transfer to a new embodiment is always few-shot at minimum.

The perception gap is the second major challenge. Foundation models are trained on datasets collected in specific labs with specific cameras, lighting, and backgrounds. When deployed in a new environment with different visual conditions, the visual encoder's representations shift, degrading policy performance. This is why the published zero-shot numbers (collected in labs similar to the training data) are 10-20 percentage points higher than real-world numbers. Domain randomization during training and vision-language pre-training both help, but neither fully closes this gap today.

The action distribution gap. Different tasks have fundamentally different action distributions -- pick-and-place involves discrete grasp/release events, while wiping involves continuous contact maintenance. A foundation model trained primarily on pick-and-place data will have poor zero-shot performance on contact-rich tasks even if the visual understanding transfers perfectly. This is why task diversity in the pre-training data matters more than dataset size for zero-shot generalization.

Inference Cost and Latency Comparison

A factor that is often overlooked in the zero-shot vs. few-shot discussion: the inference cost of running foundation models in production. A 7B-parameter VLA model requires a dedicated GPU for inference (A10G minimum, ~$0.75/hour on cloud) and introduces 100-300ms latency per action. A 12M-parameter ACT model runs on a $200 Jetson Orin Nano at 5-15ms per action.

For deployment at scale (10+ robots), the GPU inference cost of foundation models can exceed $50,000/year -- potentially more than the cost of collecting enough data to train smaller task-specific models. This is the counterintuitive economic argument: investing in data collection (a one-time cost) can be cheaper than paying ongoing inference costs for foundation models.

Practical Guidance

Plan for 200-500 demonstrations for any new task even with foundation models. Zero-shot performance is a bonus to be measured, not a baseline to be assumed. If zero-shot achieves 60%+ success on your task, consider yourself ahead of schedule. If it achieves 30%, proceed with your planned fine-tuning data collection.

The SVRC data services team can assess your specific task for zero-shot viability and recommend a realistic demonstration budget before you commit to a collection timeline.

Related Reading