Confidence Intervals for Object Detection: Why mAP Alone Isn't Enough

Computer vision teams ship models based on numbers like [email protected]:0.95 of 0.49, precision of 0.87, or recall of 0.91. These metrics appear in model cards, benchmark tables, and deployment decisions. They look definitive.

But a mAP score computed on a validation set is not a fixed property of your model. It is an estimate derived from a finite sample of images, a specific set of annotations, and a chain of evaluation choices including IoU thresholds, confidence cutoffs, and class definitions. The same model evaluated on a slightly different validation split would produce a different mAP. Sometimes the difference is small. Sometimes it changes the decision about which model to deploy.

Object detection evaluation has unique challenges that make uncertainty quantification even more critical than in classification or regression. This post explores why confidence intervals matter specifically for computer vision, and how Infer makes uncertainty-aware detection evaluation practical.

Why object detection metrics are especially uncertain

Object detection metrics like mAP are more complex and more fragile than classification accuracy. Understanding where uncertainty comes from helps explain why single-number evaluations are particularly misleading in computer vision.

Multiple objects per image compound variance

In classification, each image contributes one prediction. In detection, a single image might contain dozens of objects across multiple classes. This creates two sources of variability:

Prediction count variance: The number of detected objects varies by image, confidence threshold, and NMS settings
Per-object matching variance: Each ground truth box may or may not be matched, depending on IoU overlap and prediction ordering

A validation set of 1,000 images might contain 50,000 ground truth boxes. But those boxes are clustered within images, not independent samples. This clustering inflates variance compared to what naive sample size calculations would suggest.

IoU thresholds introduce discontinuities

Detection metrics depend on whether a predicted box overlaps sufficiently with ground truth. At IoU threshold 0.5, a prediction with 0.49 overlap is a false positive. At 0.51, the same prediction is a true positive.

This hard threshold creates a discontinuous relationship between box coordinates and metric values. Small annotation noise or model jitter can flip individual detections across the threshold boundary, causing metric instability that has nothing to do with actual model quality.

COCO-style mAP averages across 10 IoU thresholds from 0.5 to 0.95, which partially smooths this effect. But threshold sensitivity remains a fundamental source of evaluation uncertainty.

Class imbalance is often extreme

Object detection datasets frequently have severe class imbalance. COCO has 80 classes, but "person" appears in far more images than "toaster" or "hair dryer". Some classes may have only a few dozen instances in the entire validation set.

For rare classes, per-class AP is computed from a handful of predictions. The variance of these estimates is enormous, but this instability is hidden when metrics are averaged across classes.

Annotation quality varies

Ground truth annotations for detection are inherently noisy. Different annotators draw bounding boxes differently. Some objects are ambiguous or partially occluded. Annotation guidelines evolve over time.

This label noise propagates directly into metric calculations. A model that matches one annotator's style might show higher mAP than one that matches another's, even if both models have equivalent real-world performance.

What confidence intervals reveal about detection performance

Confidence intervals transform mAP and related metrics from false certainties into honest statements about what the evaluation data actually supports.

Distinguishing real improvements from noise

Consider two YOLO models evaluated on the same validation set:

Model A: [email protected]:0.95 = 0.491
Model B: [email protected]:0.95 = 0.502

Is Model B actually better? Without confidence intervals, the answer seems obvious. But with uncertainty quantification:

Model A: mAP = 0.491, 95% CI [0.439, 0.536]
Model B: mAP = 0.502, 95% CI [0.451, 0.548]

The confidence intervals overlap substantially. The observed difference of 0.011 is well within the range of sampling variability. Selecting Model B based on this comparison would be selecting based on noise.

Understanding worst-case performance

A point estimate of mAP = 0.87 sounds production-ready. But the lower bound of a confidence interval tells a different story.

If the 95% CI is [0.82, 0.91], the model might perform as poorly as 0.82 in production. Whether that worst-case scenario is acceptable depends on the application. A self-driving car perception system has different requirements than a retail inventory scanner.

Confidence intervals make this risk assessment explicit rather than leaving it to implicit assumptions about metric stability.

Exposing fragile per-class performance

Aggregate mAP hides per-class variation. A model might achieve mAP = 0.75 overall, but with vastly different reliability across classes:

Class	AP	95% CI
person	0.89	[0.86, 0.92]
car	0.81	[0.74, 0.87]
bicycle	0.72	[0.58, 0.84]
stop sign	0.65	[0.41, 0.85]

The "stop sign" class has a CI width of 0.44, spanning from poor to excellent performance. This class has high uncertainty due to limited validation samples. For safety-critical applications, this instability must be addressed before deployment.

How Infer computes detection confidence intervals

Infer provides confidence intervals for object detection metrics through image-level bootstrap resampling. This approach is statistically principled and computationally efficient.

Bootstrap at the image level

Detection predictions within an image are not independent. A false positive on one object might cause a ground truth box to remain unmatched, affecting precision and recall calculations for other objects in the same image.

Infer respects this structure by resampling at the image level. Each bootstrap iteration:

Samples images with replacement from the validation set
Includes all predictions and ground truth boxes for each sampled image
Recomputes mAP, precision, and recall on the resampled data
Repeats thousands of times to build a distribution

This approach correctly captures the correlation structure within images while measuring variability across images.

No re-inference required

A critical practical advantage: Infer reuses cached predictions. You run inference once on your validation set, then bootstrap resampling operates on those stored results.

This makes confidence interval computation fast. Computing 1,000 bootstrap iterations adds seconds to minutes, not hours of GPU time.

COCO-standard metric calculation

Infer implements COCO-standard AP calculation exactly:

IoU thresholds: 0.5:0.05:0.95 (10 thresholds)
Predictions sorted by confidence score
Greedy matching of predictions to ground truth
Area under precision-recall curve per class
Mean across classes and IoU thresholds

The point estimates match ultralytics and official COCO evaluation. The confidence intervals are computed around the same metric definitions practitioners already use.

Practical considerations

Choosing the number of resamples

More bootstrap iterations produce more stable CI estimates. Recommended values:

Quick exploration: 100-500 resamples
Standard evaluation: 1,000 resamples
Publication/deployment: 5,000-10,000 resamples

Computation time scales linearly with resamples. For a 1,000-image validation set, 1,000 resamples typically completes in under a minute.

Bootstrap method selection

Infer supports three bootstrap methods:

bootstrap_percentile: Fast and robust, recommended for most use cases
bootstrap_bca: Bias-corrected and accelerated, more accurate but slower
bootstrap_basic: Simple baseline method

For detection metrics, bootstrap_percentile is usually sufficient. The BCA method provides marginal accuracy gains at 2-3x computational cost.

Interpreting CI width

CI width indicates metric reliability:

Narrow CI (< 0.05): Metric is stable, safe to compare across models
Moderate CI (0.05-0.10): Normal variability, interpret differences carefully
Wide CI (> 0.10): High uncertainty, need more validation data or the metric itself is unstable

Wide CIs often indicate:

Small validation set
Rare classes dominating the metric
High variance in model predictions

Comparing models statistically

To determine if Model A is significantly better than Model B:

Compute CIs for both models on the same validation set
Check for CI overlap
If CIs do not overlap, the difference is statistically significant at the chosen confidence level

For overlapping CIs, the comparison is inconclusive. Consider:

Increasing the validation set size
Using paired bootstrap tests
Accepting that the models are statistically equivalent

Real-world impact

Case study: autonomous vehicle perception

An AV team evaluates two pedestrian detection models:

Model	mAP	95% CI
Current	0.847	[0.821, 0.871]
Candidate	0.859	[0.834, 0.882]

The candidate model shows higher point mAP, but the CIs overlap. The team investigates per-class performance:

Class	Current CI	Candidate CI
Adult pedestrian	[0.88, 0.92]	[0.89, 0.93]
Child pedestrian	[0.71, 0.84]	[0.68, 0.81]
Wheelchair user	[0.52, 0.78]	[0.61, 0.83]

The candidate model shows improved wheelchair detection but slightly degraded child detection. The CIs reveal this tradeoff was hidden by aggregate mAP. The team decides to collect more validation data for rare pedestrian types before making a deployment decision.

Case study: retail inventory

A retail company compares detection models for shelf monitoring:

Model	mAP	95% CI	Inference time
YOLOv8n	0.72	[0.68, 0.76]	5ms
YOLOv8s	0.78	[0.74, 0.82]	12ms
YOLOv8m	0.81	[0.77, 0.85]	28ms

The confidence intervals reveal:

YOLOv8n and YOLOv8s have non-overlapping CIs: the improvement is real
YOLOv8s and YOLOv8m have overlapping CIs: the improvement may be noise

Given the 2x inference speed difference between YOLOv8s and YOLOv8m, the team selects YOLOv8s. The larger model's accuracy advantage is not statistically significant enough to justify the latency cost.

Beyond point estimates

Object detection evaluation is not fundamentally about computing a number. It is about understanding whether a model meets requirements with sufficient reliability for deployment.

Single mAP values cannot answer questions like:

Is this accuracy improvement real or noise?
How might performance vary on different validation samples?
Which classes have reliable detection and which are fragile?
What is the worst-case performance we should plan for?

Confidence intervals provide the statistical foundation to answer these questions honestly.

Conclusion

Object detection metrics are estimates, not facts. They depend on finite validation data, noisy annotations, and arbitrary evaluation choices. Reporting mAP = 0.85 without uncertainty is claiming a precision that the evaluation process cannot support.

Confidence intervals transform detection evaluation from a ritual of computing numbers into a principled assessment of model reliability. They expose when improvements are real, when comparisons are inconclusive, and when per-class performance is too uncertain for safe deployment.

Infer makes this uncertainty quantification practical for computer vision workflows. It integrates directly with YOLO and standard detection pipelines, computes COCO-standard metrics with statistically appropriate confidence intervals, and visualizes the uncertainty that single numbers hide.

In domains where detection failures have consequences, from autonomous vehicles to medical imaging to security systems, treating evaluation uncertainty as optional is not defensible. Infer makes it measurable, visible, and actionable.

Install Infer: pip install infer-ci

GitHub: https://github.com/humblebeeai/infer-ci

Documentation: https://infer.humblebee.ai

Confidence Intervals for Object Detection: Why mAP Alone Isn't Enough

Confidence Intervals for Object Detection: Why mAP Alone Isn't Enough

Why object detection metrics are especially uncertain

Multiple objects per image compound variance

IoU thresholds introduce discontinuities

Class imbalance is often extreme

Annotation quality varies

What confidence intervals reveal about detection performance

Distinguishing real improvements from noise

Understanding worst-case performance

Exposing fragile per-class performance

How Infer computes detection confidence intervals

Bootstrap at the image level

No re-inference required

COCO-standard metric calculation

Practical considerations

Choosing the number of resamples

Bootstrap method selection

Interpreting CI width

Comparing models statistically

Real-world impact

Case study: autonomous vehicle perception

Case study: retail inventory

Beyond point estimates

Conclusion

Table of Contents