Confidence Intervals for Object Detection: Why mAP Alone Isn't Enough

Computer vision teams ship models based on numbers like [email protected]:0.95 of 0.49, precision of 0.87, or recall of 0.91. These metrics appear in model cards, benchmark tables, and deployment decisions. They look definitive.
But a mAP score computed on a validation set is not a fixed property of your model. It is an estimate derived from a finite sample of images, a specific set of annotations, and a chain of evaluation choices including IoU thresholds, confidence cutoffs, and class definitions. The same model evaluated on a slightly different validation split would produce a different mAP. Sometimes the difference is small. Sometimes it changes the decision about which model to deploy.
Object detection evaluation has unique challenges that make uncertainty quantification even more critical than in classification or regression. This post explores why confidence intervals matter specifically for computer vision, and how Infer makes uncertainty-aware detection evaluation practical.
Why object detection metrics are especially uncertain
Object detection metrics like mAP are more complex and more fragile than classification accuracy. Understanding where uncertainty comes from helps explain why single-number evaluations are particularly misleading in computer vision.
Multiple objects per image compound variance
In classification, each image contributes one prediction. In detection, a single image might contain dozens of objects across multiple classes. This creates two sources of variability:
- Prediction count variance: The number of detected objects varies by image, confidence threshold, and NMS settings
- Per-object matching variance: Each ground truth box may or may not be matched, depending on IoU overlap and prediction ordering
A validation set of 1,000 images might contain 50,000 ground truth boxes. But those boxes are clustered within images, not independent samples. This clustering inflates variance compared to what naive sample size calculations would suggest.
IoU thresholds introduce discontinuities
Detection metrics depend on whether a predicted box overlaps sufficiently with ground truth. At IoU threshold 0.5, a prediction with 0.49 overlap is a false positive. At 0.51, the same prediction is a true positive.
This hard threshold creates a discontinuous relationship between box coordinates and metric values. Small annotation noise or model jitter can flip individual detections across the threshold boundary, causing metric instability that has nothing to do with actual model quality.
COCO-style mAP averages across 10 IoU thresholds from 0.5 to 0.95, which partially smooths this effect. But threshold sensitivity remains a fundamental source of evaluation uncertainty.
Class imbalance is often extreme
Object detection datasets frequently have severe class imbalance. COCO has 80 classes, but "person" appears in far more images than "toaster" or "hair dryer". Some classes may have only a few dozen instances in the entire validation set.
For rare classes, per-class AP is computed from a handful of predictions. The variance of these estimates is enormous, but this instability is hidden when metrics are averaged across classes.
Annotation quality varies
Ground truth annotations for detection are inherently noisy. Different annotators draw bounding boxes differently. Some objects are ambiguous or partially occluded. Annotation guidelines evolve over time.
This label noise propagates directly into metric calculations. A model that matches one annotator's style might show higher mAP than one that matches another's, even if both models have equivalent real-world performance.
What confidence intervals reveal about detection performance
Confidence intervals transform mAP and related metrics from false certainties into honest statements about what the evaluation data actually supports.
Distinguishing real improvements from noise
Consider two YOLO models evaluated on the same validation set:
- Model A: [email protected]:0.95 = 0.491
- Model B: [email protected]:0.95 = 0.502
Is Model B actually better? Without confidence intervals, the answer seems obvious. But with uncertainty quantification:
- Model A: mAP = 0.491, 95% CI [0.439, 0.536]
- Model B: mAP = 0.502, 95% CI [0.451, 0.548]
The confidence intervals overlap substantially. The observed difference of 0.011 is well within the range of sampling variability. Selecting Model B based on this comparison would be selecting based on noise.
Understanding worst-case performance
A point estimate of mAP = 0.87 sounds production-ready. But the lower bound of a confidence interval tells a different story.
If the 95% CI is [0.82, 0.91], the model might perform as poorly as 0.82 in production. Whether that worst-case scenario is acceptable depends on the application. A self-driving car perception system has different requirements than a retail inventory scanner.
Confidence intervals make this risk assessment explicit rather than leaving it to implicit assumptions about metric stability.
Exposing fragile per-class performance
Aggregate mAP hides per-class variation. A model might achieve mAP = 0.75 overall, but with vastly different reliability across classes:
| Class | AP | 95% CI |
|---|---|---|
| person | 0.89 | [0.86, 0.92] |
| car | 0.81 | [0.74, 0.87] |
| bicycle | 0.72 | [0.58, 0.84] |
| stop sign | 0.65 | [0.41, 0.85] |
The "stop sign" class has a CI width of 0.44, spanning from poor to excellent performance. This class has high uncertainty due to limited validation samples. For safety-critical applications, this instability must be addressed before deployment.
How Infer computes detection confidence intervals
Infer provides confidence intervals for object detection metrics through image-level bootstrap resampling. This approach is statistically principled and computationally efficient.
Bootstrap at the image level
Detection predictions within an image are not independent. A false positive on one object might cause a ground truth box to remain unmatched, affecting precision and recall calculations for other objects in the same image.
Infer respects this structure by resampling at the image level. Each bootstrap iteration:
- Samples images with replacement from the validation set
- Includes all predictions and ground truth boxes for each sampled image
- Recomputes mAP, precision, and recall on the resampled data
- Repeats thousands of times to build a distribution
This approach correctly captures the correlation structure within images while measuring variability across images.
No re-inference required
A critical practical advantage: Infer reuses cached predictions. You run inference once on your validation set, then bootstrap resampling operates on those stored results.
This makes confidence interval computation fast. Computing 1,000 bootstrap iterations adds seconds to minutes, not hours of GPU time.
COCO-standard metric calculation
Infer implements COCO-standard AP calculation exactly:
- IoU thresholds: 0.5:0.05:0.95 (10 thresholds)
- Predictions sorted by confidence score
- Greedy matching of predictions to ground truth
- Area under precision-recall curve per class
- Mean across classes and IoU thresholds
The point estimates match ultralytics and official COCO evaluation. The confidence intervals are computed around the same metric definitions practitioners already use.
Practical considerations
Choosing the number of resamples
More bootstrap iterations produce more stable CI estimates. Recommended values:
- Quick exploration: 100-500 resamples
- Standard evaluation: 1,000 resamples
- Publication/deployment: 5,000-10,000 resamples
Computation time scales linearly with resamples. For a 1,000-image validation set, 1,000 resamples typically completes in under a minute.
Bootstrap method selection
Infer supports three bootstrap methods:
- bootstrap_percentile: Fast and robust, recommended for most use cases
- bootstrap_bca: Bias-corrected and accelerated, more accurate but slower
- bootstrap_basic: Simple baseline method
For detection metrics, bootstrap_percentile is usually sufficient. The BCA method provides marginal accuracy gains at 2-3x computational cost.
Interpreting CI width
CI width indicates metric reliability:
- Narrow CI (< 0.05): Metric is stable, safe to compare across models
- Moderate CI (0.05-0.10): Normal variability, interpret differences carefully
- Wide CI (> 0.10): High uncertainty, need more validation data or the metric itself is unstable
Wide CIs often indicate:
- Small validation set
- Rare classes dominating the metric
- High variance in model predictions
Comparing models statistically
To determine if Model A is significantly better than Model B:
- Compute CIs for both models on the same validation set
- Check for CI overlap
- If CIs do not overlap, the difference is statistically significant at the chosen confidence level
For overlapping CIs, the comparison is inconclusive. Consider:
- Increasing the validation set size
- Using paired bootstrap tests
- Accepting that the models are statistically equivalent
Real-world impact
Case study: autonomous vehicle perception
An AV team evaluates two pedestrian detection models:
| Model | mAP | 95% CI |
|---|---|---|
| Current | 0.847 | [0.821, 0.871] |
| Candidate | 0.859 | [0.834, 0.882] |
The candidate model shows higher point mAP, but the CIs overlap. The team investigates per-class performance:
| Class | Current CI | Candidate CI |
|---|---|---|
| Adult pedestrian | [0.88, 0.92] | [0.89, 0.93] |
| Child pedestrian | [0.71, 0.84] | [0.68, 0.81] |
| Wheelchair user | [0.52, 0.78] | [0.61, 0.83] |
The candidate model shows improved wheelchair detection but slightly degraded child detection. The CIs reveal this tradeoff was hidden by aggregate mAP. The team decides to collect more validation data for rare pedestrian types before making a deployment decision.
Case study: retail inventory
A retail company compares detection models for shelf monitoring:
| Model | mAP | 95% CI | Inference time |
|---|---|---|---|
| YOLOv8n | 0.72 | [0.68, 0.76] | 5ms |
| YOLOv8s | 0.78 | [0.74, 0.82] | 12ms |
| YOLOv8m | 0.81 | [0.77, 0.85] | 28ms |
The confidence intervals reveal:
- YOLOv8n and YOLOv8s have non-overlapping CIs: the improvement is real
- YOLOv8s and YOLOv8m have overlapping CIs: the improvement may be noise
Given the 2x inference speed difference between YOLOv8s and YOLOv8m, the team selects YOLOv8s. The larger model's accuracy advantage is not statistically significant enough to justify the latency cost.
Beyond point estimates
Object detection evaluation is not fundamentally about computing a number. It is about understanding whether a model meets requirements with sufficient reliability for deployment.
Single mAP values cannot answer questions like:
- Is this accuracy improvement real or noise?
- How might performance vary on different validation samples?
- Which classes have reliable detection and which are fragile?
- What is the worst-case performance we should plan for?
Confidence intervals provide the statistical foundation to answer these questions honestly.
Conclusion
Object detection metrics are estimates, not facts. They depend on finite validation data, noisy annotations, and arbitrary evaluation choices. Reporting mAP = 0.85 without uncertainty is claiming a precision that the evaluation process cannot support.
Confidence intervals transform detection evaluation from a ritual of computing numbers into a principled assessment of model reliability. They expose when improvements are real, when comparisons are inconclusive, and when per-class performance is too uncertain for safe deployment.
Infer makes this uncertainty quantification practical for computer vision workflows. It integrates directly with YOLO and standard detection pipelines, computes COCO-standard metrics with statistically appropriate confidence intervals, and visualizes the uncertainty that single numbers hide.
In domains where detection failures have consequences, from autonomous vehicles to medical imaging to security systems, treating evaluation uncertainty as optional is not defensible. Infer makes it measurable, visible, and actionable.
Install Infer: pip install infer-ci
GitHub: https://github.com/humblebeeai/infer-ci
Documentation: https://infer.humblebee.ai