Building a Real-Time Object Detection Pipeline with Python and OpenCV

Why Computer Vision Pipelines Are Hard in Production

Getting a YOLO model to detect objects on your laptop is easy. Getting it to reliably process a live video stream, handle edge cases, and serve predictions over HTTP at low latency — that's a different problem entirely.

In this article I'll walk through the architecture I've used in production: an OpenCV frame capture loop feeding into a YOLOv8 inference engine, wrapped in a FastAPI service that returns structured JSON predictions.

Prerequisites

pip install ultralytics opencv-python fastapi uvicorn python-multipart

We're using YOLOv8 via the ultralytics package — it handles model weights, inference, and post-processing in one clean API.

Step 1 — Load the Model Once at Startup

The biggest mistake I see in CV services is loading the model on every request. YOLO weights are hundreds of MBs — load them once at startup and keep them in memory.

# model.py
from ultralytics import YOLO

_model = None

def get_model() -> YOLO:
    global _model
    if _model is None:
        _model = YOLO("yolov8n.pt")  # nano = fastest, use yolov8x for accuracy
    return _model

Step 2 — The Inference Function

Accept a raw image as bytes, decode with OpenCV, run inference, and return structured results.

# inference.py
import cv2
import numpy as np
from model import get_model

def run_detection(image_bytes: bytes, confidence: float = 0.5) -> list[dict]:
    nparr = np.frombuffer(image_bytes, np.uint8)
    frame = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

    model = get_model()
    results = model(frame, conf=confidence)[0]

    detections = []
    for box in results.boxes:
        detections.append({
            "label": results.names[int(box.cls)],
            "confidence": round(float(box.conf), 3),
            "bbox": {
                "x1": int(box.xyxy[0][0]),
                "y1": int(box.xyxy[0][1]),
                "x2": int(box.xyxy[0][2]),
                "y2": int(box.xyxy[0][3]),
            },
        })
    return detections

Step 3 — FastAPI REST Endpoint

Expose the inference function over HTTP. Accept multipart/form-data for image uploads.

# main.py
from fastapi import FastAPI, UploadFile, File, Query
from inference import run_detection

app = FastAPI(title="Computer Vision API")

@app.post("/detect")
async def detect_objects(
    file: UploadFile = File(...),
    confidence: float = Query(default=0.5, ge=0.1, le=1.0),
):
    image_bytes = await file.read()
    detections = run_detection(image_bytes, confidence)
    return {"count": len(detections), "detections": detections}

@app.get("/health")
def health():
    return {"status": "ok"}

Run it locally:

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2

Step 4 — Real-Time Video Stream Processing

For live CCTV or webcam feeds, use a separate background worker instead of blocking the HTTP thread.

# stream_worker.py
import cv2, threading
from inference import run_detection

class StreamWorker(threading.Thread):
    def __init__(self, source: str | int = 0):
        super().__init__(daemon=True)
        self.source = source
        self.latest_detections = []

    def run(self):
        cap = cv2.VideoCapture(self.source)
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            _, encoded = cv2.imencode(".jpg", frame)
            self.latest_detections = run_detection(encoded.tobytes())
        cap.release()

Step 5 — Dockerise for Deployment

FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y libgl1 libglib2.0-0 && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN python -c "from ultralytics import YOLO; YOLO('yolov8n.pt')"
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Production Tips

▸Use GPU instances (AWS g4dn.xlarge) for real-time video — CPU inference is too slow above 5 FPS
▸Cache model on EFS when running multiple ECS tasks to avoid re-downloading weights on each start
▸Set confidence threshold via environment variable — not hardcoded — so you can tune per deployment
▸Add request queuing (Redis + Celery) for batch image processing workloads
▸Log bounding box counts per class to CloudWatch for drift detection

Wrapping Up

This pattern — model singleton, bytes-in/JSON-out inference function, FastAPI wrapper, Docker container — scales from a weekend project to a production service handling thousands of images per minute.

In a follow-up post I'll cover deploying this exact service on AWS ECS with auto-scaling based on SQS queue depth.