ML API Development & Deployment
Master the art of serving, scaling, and operationalizing machine learning models with high-performance FastAPI.
What is an ML API?
An ML API exposes a trained machine learning model via an HTTP interface so other applications can use it for predictions.
📡 Real-World Example
POST /predict
{
"text": "I love this product"
}{
"sentiment": "positive"
}ML Model Serving with FastAPI
Step 1: Save Model
import joblib
# After training
joblib.dump(model, "model.pkl")Step 2: Load and Predict
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
app = FastAPI()
model = joblib.load("model.pkl")
class InputData(BaseModel):
text: str
@app.post("/predict")
def predict(data: InputData):
result = model.predict([data.text])
return {"prediction": result[0]}ML Pipeline (VERY IMPORTANT)
A sequence of steps: Preprocessing → Model Prediction → Postprocessing.
def preprocess(text):
return text.lower()
def postprocess(pred):
return "positive" if pred == 1 else "negative"
@app.post("/predict")
def predict(data: InputData):
processed = preprocess(data.text)
raw = model.predict([processed])
final = postprocess(raw[0])
return {"result": final}Handling Large Models
The Problem & Solution
Large models cause slow loading and high latency if reloaded on every request. **Load once at startup** using lifespan events.
@app.on_event("startup")
def load_model():
global model
model = joblib.load("model.pkl")Async + ML (Important Insight)
⚡ The Sync/Async Rule
Most ML libraries are CPU-bound, not async-friendly.
Batch Predictions (Optimization)
Instead of predicting one-by-one, batching multiple inputs significantly improves throughput.
from typing import List
class BatchInput(BaseModel):
texts: List[str]
@app.post("/batch-predict")
def batch_predict(data: BatchInput):
results = model.predict(data.texts)
return {"results": results.tolist()}Logging Predictions (DB Integration)
@app.post("/predict")
def predict(data: InputData, db=Depends(get_db)):
result = model.predict([data.text])[0]
db_entry = Prediction(input_data=data.text, result=str(result))
db.add(db_entry)
db.commit()
return {"result": result}Versioning Models
Model Files
Track improvements and roll back easily.
model_v1.pkl
model_v2.pklAPI Versioning
@app.post("/v1/predict")
def predict_v1():
pass
@app.post("/v2/predict")
def predict_v2():
passDockerization (VERY IMPORTANT)
Containerize your application + dependencies for consistent deployment across environments.
🐳 Dockerfile
FROM python:3.10
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]📦 Build & Run Commands
docker build -t ml-api .docker run -p 80:80 ml-apiDeployment & Scaling
Deployment Platforms
AWS (EC2, ECS), Google Cloud (GKE), Azure ML.
Horizontal Scaling
Add multiple instances via Load Balancer.
Tools of Choice
Kubernetes, Docker Swarm, ECS.
The scaling challenge
Model duplication increases memory usage exponentially.
Solution: Shared Model Servers or intelligent Model Caching.
Performance & Security
Performance Optimization
- • Batch Requests: Optimize CPU/GPU usage.
- • Redis Caching: Cache common predictions.
- • Reduce Size: Quantization/Pruning.
- • Use GPU: For deep learning models.
Security & Validation
- • Protect Endpoints: JWT authentication.
- • Input Validation: Strict request schemas.
- • Rate Limiting: Prevent abuse/DDoS.
Monitoring & CI/CD
Track Everything
Monitoring prediction distribution and request latency is key to identifying model drift and API saturation early.
🔄 CI/CD Flow for ML
Train Model → Save Artifact → Build Docker Image → Deploy.
Interview Preparation
🧠 Theory Questions
What is model serving?▼
ML Pipeline steps?▼
Sync vs Async in ML?▼
How to scale ML APIs?▼
Model versioning importance?▼
Challenges in ML APIs?▼
How do you deploy an ML Model?▼
💻 Coding Challenges
@app.post("/predict")\ndef predict(data: InputData):\n return {"res": model.predict([data.text])[0]}@app.post("/batch")\ndef batch_predict(data: BatchInput):\n return model.predict(data.texts).tolist()@app.on_event("startup")\ndef load():\n global model\n model = joblib.load("model.pkl")