Why Computer Vision Pipelines Are Hard in Production
Getting a YOLO model to detect objects on your laptop is easy. Getting it to reliably process a live video stream, handle edge cases, and serve predictions over HTTP at low latency — that's a different problem entirely.
In this article I'll walk through the architecture I've used in production: an OpenCV frame capture loop feeding into a YOLOv8 inference engine, wrapped in a FastAPI service that returns structured JSON predictions.
Prerequisites
pip install ultralytics opencv-python fastapi uvicorn python-multipartWe're using YOLOv8 via the ultralytics package — it handles model weights, inference, and post-processing in one clean API.
Step 1 — Load the Model Once at Startup
The biggest mistake I see in CV services is loading the model on every request. YOLO weights are hundreds of MBs — load them once at startup and keep them in memory.
# model.py
from ultralytics import YOLO
_model = None
def get_model() -> YOLO:
global _model
if _model is None:
_model = YOLO("yolov8n.pt") # nano = fastest, use yolov8x for accuracy
return _modelStep 2 — The Inference Function
Accept a raw image as bytes, decode with OpenCV, run inference, and return structured results.
# inference.py
import cv2
import numpy as np
from model import get_model
def run_detection(image_bytes: bytes, confidence: float = 0.5) -> list[dict]:
nparr = np.frombuffer(image_bytes, np.uint8)
frame = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
model = get_model()
results = model(frame, conf=confidence)[0]
detections = []
for box in results.boxes:
detections.append({
"label": results.names[int(box.cls)],
"confidence": round(float(box.conf), 3),
"bbox": {
"x1": int(box.xyxy[0][0]),
"y1": int(box.xyxy[0][1]),
"x2": int(box.xyxy[0][2]),
"y2": int(box.xyxy[0][3]),
},
})
return detectionsStep 3 — FastAPI REST Endpoint
Expose the inference function over HTTP. Accept multipart/form-data for image uploads.
# main.py
from fastapi import FastAPI, UploadFile, File, Query
from inference import run_detection
app = FastAPI(title="Computer Vision API")
@app.post("/detect")
async def detect_objects(
file: UploadFile = File(...),
confidence: float = Query(default=0.5, ge=0.1, le=1.0),
):
image_bytes = await file.read()
detections = run_detection(image_bytes, confidence)
return {"count": len(detections), "detections": detections}
@app.get("/health")
def health():
return {"status": "ok"}Run it locally:
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2Step 4 — Real-Time Video Stream Processing
For live CCTV or webcam feeds, use a separate background worker instead of blocking the HTTP thread.
# stream_worker.py
import cv2, threading
from inference import run_detection
class StreamWorker(threading.Thread):
def __init__(self, source: str | int = 0):
super().__init__(daemon=True)
self.source = source
self.latest_detections = []
def run(self):
cap = cv2.VideoCapture(self.source)
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
_, encoded = cv2.imencode(".jpg", frame)
self.latest_detections = run_detection(encoded.tobytes())
cap.release()Step 5 — Dockerise for Deployment
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y libgl1 libglib2.0-0 && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN python -c "from ultralytics import YOLO; YOLO('yolov8n.pt')"
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]Production Tips
- ▸Use GPU instances (AWS g4dn.xlarge) for real-time video — CPU inference is too slow above 5 FPS
- ▸Cache model on EFS when running multiple ECS tasks to avoid re-downloading weights on each start
- ▸Set confidence threshold via environment variable — not hardcoded — so you can tune per deployment
- ▸Add request queuing (Redis + Celery) for batch image processing workloads
- ▸Log bounding box counts per class to CloudWatch for drift detection
Wrapping Up
This pattern — model singleton, bytes-in/JSON-out inference function, FastAPI wrapper, Docker container — scales from a weekend project to a production service handling thousands of images per minute.
In a follow-up post I'll cover deploying this exact service on AWS ECS with auto-scaling based on SQS queue depth.