Vertical vs Horizontal
The fundamental decision in system design. Make it bigger or make more of them?
Vertical Scaling
Often called "Scaling Up". You simply upgrade the server (CPU, RAM).
Same server, just more power.
Pros
- ✔ No code changes required.
- ✔ Simple to manage (1 server).
- ✔ Good for database performance initially.
Cons
- ✖ Hardware Limit (Cannot scale infinitely).
- ✖ Downtime required to upgrade.
- ✖ Single Point of Failure.
Like upgrading from a Toyota Corolla engine to a Ferrari engine. It's still one car, just faster.Horizontal Scaling
Often called "Scaling Out". You add more servers to the pool.
Many small servers (Commodity Hardware) working together.
Pros
- ✔ Infinite Scaling (theoretically).
- ✔ High Availability (If one dies, others work).
- ✔ No downtime upgrades (Rolling Updates).
Cons
- ✖ Complex ecosystem (Load Balancers required).
- ✖ Data Consistency challenges.
- ✖ Network overhead.
Instead of one Ferrari, you buy 10 Toyota Corollas and hire 10 drivers. You can carry 10x more people.When to Choose?
Vertical Scaling
- Early Stage: Simple to set up, perfect for MVPs.
- Databases: Often easier to vertical scale a master DB than shard it.
- Monolithic Apps: Legacy apps that can't run on multiple servers.
Horizontal Scaling
- High Traffic: When you hit the limit of the biggest machine.
- High Availability: Critical apps that cannot tolerate downtime.
- Microservices: Naturally fits independent services.
The Single-Threaded Bottleneck
Vertical scaling has a hidden trap for modern languages. Even if you buy a monster 64-Core server, your app might only use 1 Core.
Solutions: How to use all Cores?
Capacity Estimation
Before we talk about complex architectures, we need to do the Paper Math. How many servers do you actually need?
The System Design Interview
In almost every system design interview, you will be asked:
- How would you scale your server?
- How do you handle sudden traffic spikes?
- How can you support a certain SLA (Service Level Agreement) given 1M users?
The Strategy: 4 Steps to Scale
Example: PayTM Flash Sale
Scenario: 8:00 PM Flash Sale.
- Normal Traffic:1,000 req/s
- Spike Traffic:100,000 req/s
- The Problem: Traffic spikes in "Seconds", but servers take 5 minutes to boot.
The Solution: Autoscaling Logic
- Monitor: Watch metrics (CPU/RPS).
- Alarm: Threshold crossed (e.g. > 80% CPU).
- Decision: ASG Policy says "Add +3 Servers".
- Action: New VMs boot up and register with Load Balancer.
Critical: When RPS Fails
Metrics like req/s are great for simple apps (PayTM, Twitter). But they fail miserably for Heavy Compute Jobs.
A single request (e.g., "Generate 4K Video") might take 30 seconds.
- ✓CPU/GPU Utilization: Scale if utilization > 70%.
- ✓Queue Depth: If there are > 5 jobs pending, add a worker node.
Questions to ask:
1. What is the Read/Write ratio? (Reads are cheap, Writes are hard)
2. Average Request Size? (1KB vs 10MB)
3. Desired Latency? (Do we need to reply in 50ms?)
Capacity = (Traffic * Request_Weight) / Single_Server_LimitResource Intensive Architectures
The "Heavy Compute" Problem
Horizontal scaling works for simple requests (get user, save post). But if a request takes 30 seconds (e.g., video transcoding, executing user code), your server hits its connection limit instantly.
Autoscaling cannot save you here. It responds too slowly.
Why Autoscaling Fails
- 1.Connection Bloat: If 1000 users send 10-minute requests, you need 1000 active HTTP connections. Browsers will timeout.
- 2.Wasted Resources: If you spin up a server for a 10s job, boot time (30s) > Execution time.
- 3.Loss of Control: If 1M users click "Generate", you crash. With a queue, you just have 1M items in Redis, but your workers run peacefully at their own pace.
Why Queues Win
- 1.Decoupling: The API responds instantly ("Job Accepted"). The User can poll for status or get a webhook.
- 2.Cost Efficiency: You can keep 10 workers for 1M jobs. It will just take longer, but it won't crash or cost $1M.
- 3.Graceful Shutdown: Workers can finish their current video render before shutting down, unlike a web server which might be killed by an autoscale-down event.
What's Next?
We have optimized our single server to use all its CPU cores (Vertical Scaling). But eventually, one server is not enough. You will need 10, 50, or 100 servers.
How do you distribute traffic to them? How do you ensure users don't get sent to a dead server?
For that, we need a Load Balancer.
Review: Database Scaling
Stateless app servers are easy to scale. Databases (Stateful) are hard.
If you just copy the database to 3 servers:
1. User A writes to DB Server 1.
2. User B reads from DB Server 2.
3. User B might not see User A's data yet! (Replication Lag)1. Read Replicas (Master-Slave)
One Primary node handles all WRITES. Multiple Replica nodes handle READS.
2. Sharding
Splitting data across multiple servers (e.g., Users A-M on Server 1, N-Z on Server 2).