Module 06

ML API Development & Deployment

Master the art of serving, scaling, and operationalizing machine learning models with high-performance FastAPI.

What is an ML API?

An ML API exposes a trained machine learning model via an HTTP interface so other applications can use it for predictions.

📡 Input

Client sends input (e.g., text, image) → API.

⚙️ Processing

API processes input and prepares data for the model.

🧠 Prediction

Model generates a prediction/result.

📥 Response

API returns the result as a JSON response.

📡 Real-World Example

Request

json

POST /predict
{
  "text": "I love this product"
}

Response

json

{
  "sentiment": "positive"
}

ML Model Serving with FastAPI

Step 1: Save Model

save_model.py

python

import joblib

# After training
joblib.dump(model, "model.pkl")

Step 2: Load and Predict

main.py

python

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load("model.pkl")

class InputData(BaseModel):
    text: str

@app.post("/predict")
def predict(data: InputData):
    result = model.predict([data.text])
    return {"prediction": result[0]}

ML Pipeline (VERY IMPORTANT)

A sequence of steps: Preprocessing → Model Prediction → Postprocessing.

Pipeline Logic

python

def preprocess(text):
    return text.lower()

def postprocess(pred):
    return "positive" if pred == 1 else "negative"

@app.post("/predict")
def predict(data: InputData):
    processed = preprocess(data.text)
    raw = model.predict([processed])
    final = postprocess(raw[0])
    return {"result": final}

Clean Architecture

Reusability

Easier Debugging

Handling Large Models

The Problem & Solution

Large models cause slow loading and high latency if reloaded on every request. **Load once at startup** using lifespan events.

Events.py

python

@app.on_event("startup")
def load_model():
    global model
    model = joblib.load("model.pkl")

Async + ML (Important Insight)

⚡ The Sync/Async Rule

Most ML libraries are CPU-bound, not async-friendly.

Use Async For

I/O operations (DB, APIs, Files)

Use Sync For

ML Computations & Model Inference

Batch Predictions (Optimization)

Instead of predicting one-by-one, batching multiple inputs significantly improves throughput.

Batch API

python

from typing import List

class BatchInput(BaseModel):
    texts: List[str]

@app.post("/batch-predict")
def batch_predict(data: BatchInput):
    results = model.predict(data.texts)
    return {"results": results.tolist()}

Logging Predictions (DB Integration)

Persistence Logic

python

@app.post("/predict")
def predict(data: InputData, db=Depends(get_db)):
    result = model.predict([data.text])[0]
    
    db_entry = Prediction(input_data=data.text, result=str(result))
    db.add(db_entry)
    db.commit()

    return {"result": result}

Versioning Models

Model Files

Track improvements and roll back easily.

model_v1.pkl
model_v2.pkl

API Versioning

Semantic Versioning

python

@app.post("/v1/predict")
def predict_v1():
    pass

@app.post("/v2/predict")
def predict_v2():
    pass

Dockerization (VERY IMPORTANT)

Containerize your application + dependencies for consistent deployment across environments.

🐳 Dockerfile

Dockerfile

dockerfile

FROM python:3.10
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

📦 Build & Run Commands

Build

bash

docker build -t ml-api .

Run

bash

docker run -p 80:80 ml-api

Deployment & Scaling

Deployment Platforms

AWS (EC2, ECS), Google Cloud (GKE), Azure ML.

Horizontal Scaling

Add multiple instances via Load Balancer.

Tools of Choice

Kubernetes, Docker Swarm, ECS.

The scaling challenge

Model duplication increases memory usage exponentially.

Solution: Shared Model Servers or intelligent Model Caching.

Performance & Security

Performance Optimization

• Batch Requests: Optimize CPU/GPU usage.
• Redis Caching: Cache common predictions.
• Reduce Size: Quantization/Pruning.
• Use GPU: For deep learning models.

Security & Validation

• Protect Endpoints: JWT authentication.
• Input Validation: Strict request schemas.
• Rate Limiting: Prevent abuse/DDoS.

Monitoring & CI/CD

📊

Track Everything

Monitoring prediction distribution and request latency is key to identifying model drift and API saturation early.

PrometheusGrafanaMLflowModel DriftLatency Distribution

🔄 CI/CD Flow for ML

Train Model → Save Artifact → Build Docker Image → Deploy.

Interview Preparation

🧠 Theory Questions

What is model serving?▼

Exposing an ML model via an HTTP interface (FastAPI) so other services can consume its predictions.

ML Pipeline steps?▼

Preprocessing (data cleaning) → Model Prediction (inference) → Postprocessing (formatting result).

Sync vs Async in ML?▼

CPU-bound tasks (inference) are typically sync; I/O bound tasks (database/calls) should be async.

How to scale ML APIs?▼

Use Load Balancers with Horizontal Scaling (multiple replicas) and shared model caching/servers.

Model versioning importance?▼

Tracks performance improvements and allows for rapid rollback if a new model fails.

Challenges in ML APIs?▼

High latency, significant memory usage, version drift, and model-API synchronization.

How do you deploy an ML Model?▼

Typically using FastAPI + Docker + Cloud Platforms (AWS/GCP) for portability and scaling.

💻 Coding Challenges

Q1: Basic Predict API

Challenge Snippet

python

@app.post("/predict")\ndef predict(data: InputData):\n    return {"res": model.predict([data.text])[0]}

Q2: Batch Predict

Challenge Snippet

python

@app.post("/batch")\ndef batch_predict(data: BatchInput):\n    return model.predict(data.texts).tolist()

Q3: Lifecycle Load

Challenge Snippet

python

@app.on_event("startup")\ndef load():\n    global model\n    model = joblib.load("model.pkl")