>_
EngineeringNotes
← Back to FastAPI & Python
Module 06

ML API Development & Deployment

Master the art of serving, scaling, and operationalizing machine learning models with high-performance FastAPI.

01

What is an ML API?

An ML API exposes a trained machine learning model via an HTTP interface so other applications can use it for predictions.

1
📡 Input
Client sends input (e.g., text, image) → API.
2
⚙️ Processing
API processes input and prepares data for the model.
3
🧠 Prediction
Model generates a prediction/result.
4
📥 Response
API returns the result as a JSON response.
📡 Real-World Example
Request
json
POST /predict
{
  "text": "I love this product"
}
Response
json
{
  "sentiment": "positive"
}
02

ML Model Serving with FastAPI

Step 1: Save Model

save_model.py
python
import joblib

# After training
joblib.dump(model, "model.pkl")

Step 2: Load and Predict

main.py
python
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load("model.pkl")

class InputData(BaseModel):
    text: str

@app.post("/predict")
def predict(data: InputData):
    result = model.predict([data.text])
    return {"prediction": result[0]}
03

ML Pipeline (VERY IMPORTANT)

A sequence of steps: Preprocessing → Model Prediction → Postprocessing.

Pipeline Logic
python
def preprocess(text):
    return text.lower()

def postprocess(pred):
    return "positive" if pred == 1 else "negative"

@app.post("/predict")
def predict(data: InputData):
    processed = preprocess(data.text)
    raw = model.predict([processed])
    final = postprocess(raw[0])
    return {"result": final}
Clean Architecture
Reusability
Easier Debugging
04

Handling Large Models

The Problem & Solution

Large models cause slow loading and high latency if reloaded on every request. **Load once at startup** using lifespan events.

Events.py
python
@app.on_event("startup")
def load_model():
    global model
    model = joblib.load("model.pkl")
05

Async + ML (Important Insight)

The Sync/Async Rule

Most ML libraries are CPU-bound, not async-friendly.

Use Async For
I/O operations (DB, APIs, Files)
Use Sync For
ML Computations & Model Inference
06

Batch Predictions (Optimization)

Instead of predicting one-by-one, batching multiple inputs significantly improves throughput.

Batch API
python
from typing import List

class BatchInput(BaseModel):
    texts: List[str]

@app.post("/batch-predict")
def batch_predict(data: BatchInput):
    results = model.predict(data.texts)
    return {"results": results.tolist()}
07

Logging Predictions (DB Integration)

Persistence Logic
python
@app.post("/predict")
def predict(data: InputData, db=Depends(get_db)):
    result = model.predict([data.text])[0]
    
    db_entry = Prediction(input_data=data.text, result=str(result))
    db.add(db_entry)
    db.commit()

    return {"result": result}
08

Versioning Models

Model Files

Track improvements and roll back easily.

model_v1.pkl model_v2.pkl
API Versioning
Semantic Versioning
python
@app.post("/v1/predict")
def predict_v1():
    pass

@app.post("/v2/predict")
def predict_v2():
    pass
09

Dockerization (VERY IMPORTANT)

Containerize your application + dependencies for consistent deployment across environments.

🐳 Dockerfile
Dockerfile
dockerfile
FROM python:3.10
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
📦 Build & Run Commands
Build
bash
docker build -t ml-api .
Run
bash
docker run -p 80:80 ml-api
10

Deployment & Scaling

Deployment Platforms

AWS (EC2, ECS), Google Cloud (GKE), Azure ML.

Horizontal Scaling

Add multiple instances via Load Balancer.

Tools of Choice

Kubernetes, Docker Swarm, ECS.

The scaling challenge

Model duplication increases memory usage exponentially.

Solution: Shared Model Servers or intelligent Model Caching.

11

Performance & Security

Performance Optimization
  • Batch Requests: Optimize CPU/GPU usage.
  • Redis Caching: Cache common predictions.
  • Reduce Size: Quantization/Pruning.
  • Use GPU: For deep learning models.
Security & Validation
  • Protect Endpoints: JWT authentication.
  • Input Validation: Strict request schemas.
  • Rate Limiting: Prevent abuse/DDoS.
12

Monitoring & CI/CD

📊

Track Everything

Monitoring prediction distribution and request latency is key to identifying model drift and API saturation early.

PrometheusGrafanaMLflowModel DriftLatency Distribution
🔄 CI/CD Flow for ML

Train Model → Save Artifact → Build Docker Image → Deploy.

13

Interview Preparation

🧠 Theory Questions

What is model serving?
Exposing an ML model via an HTTP interface (FastAPI) so other services can consume its predictions.
ML Pipeline steps?
Preprocessing (data cleaning) → Model Prediction (inference) → Postprocessing (formatting result).
Sync vs Async in ML?
CPU-bound tasks (inference) are typically sync; I/O bound tasks (database/calls) should be async.
How to scale ML APIs?
Use Load Balancers with Horizontal Scaling (multiple replicas) and shared model caching/servers.
Model versioning importance?
Tracks performance improvements and allows for rapid rollback if a new model fails.
Challenges in ML APIs?
High latency, significant memory usage, version drift, and model-API synchronization.
How do you deploy an ML Model?
Typically using FastAPI + Docker + Cloud Platforms (AWS/GCP) for portability and scaling.

💻 Coding Challenges

Q1: Basic Predict API
Challenge Snippet
python
@app.post("/predict")\ndef predict(data: InputData):\n    return {"res": model.predict([data.text])[0]}
Q2: Batch Predict
Challenge Snippet
python
@app.post("/batch")\ndef batch_predict(data: BatchInput):\n    return model.predict(data.texts).tolist()
Q3: Lifecycle Load
Challenge Snippet
python
@app.on_event("startup")\ndef load():\n    global model\n    model = joblib.load("model.pkl")