>_
EngineeringNotes
Back to Scaling (Overview)
Module 09 - Chapter 01

Vertical vs Horizontal

The fundamental decision in system design. Make it bigger or make more of them?

01

Vertical Scaling

Often called "Scaling Up". You simply upgrade the server (CPU, RAM).

t2.micro
t2.large
96 vCPU
m5.24xlarge

Same server, just more power.

Pros

  • No code changes required.
  • Simple to manage (1 server).
  • Good for database performance initially.

Cons

  • Hardware Limit (Cannot scale infinitely).
  • Downtime required to upgrade.
  • Single Point of Failure.
Analogy
text
Like upgrading from a Toyota Corolla engine to a Ferrari engine. It's still one car, just faster.
02

Horizontal Scaling

Often called "Scaling Out". You add more servers to the pool.

Server 1
+
Server 2
+
Server 3

Many small servers (Commodity Hardware) working together.

Pros

  • Infinite Scaling (theoretically).
  • High Availability (If one dies, others work).
  • No downtime upgrades (Rolling Updates).

Cons

  • Complex ecosystem (Load Balancers required).
  • Data Consistency challenges.
  • Network overhead.
Analogy
text
Instead of one Ferrari, you buy 10 Toyota Corollas and hire 10 drivers. You can carry 10x more people.
03

When to Choose?

Vertical Scaling

  • Early Stage: Simple to set up, perfect for MVPs.
  • Databases: Often easier to vertical scale a master DB than shard it.
  • Monolithic Apps: Legacy apps that can't run on multiple servers.

Horizontal Scaling

  • High Traffic: When you hit the limit of the biggest machine.
  • High Availability: Critical apps that cannot tolerate downtime.
  • Microservices: Naturally fits independent services.
04

The Single-Threaded Bottleneck

Vertical scaling has a hidden trap for modern languages. Even if you buy a monster 64-Core server, your app might only use 1 Core.

Server (8 Cores)

Node.js is Single Threaded

Node.js runs on the Event Loop. It is designed to be single-threaded.

  • If you have an 8-core CPU, Node.js will use 1 core.
  • The other 7 cores sit IDLE (wasted money).
To fix this, we need special techniques (Cluster Module / Worker Threads).
(More on this on next section...)
05

Solutions: How to use all Cores?

Attempt 1: Run the same file multiple times

If we have 8 cores, why not just open 8 terminals and run node index.js 8 times?

$ node index.js
Running on PID 4001... Listening on 3000
$ node index.js
Error: listen EADDRINUSE: address already in use :::3000

Advantages

  • Works great for general scripts (image processing, cron jobs).
  • Simple to understand.

Disadvantages

  • Port Conflicts: You can't bind two processes to port 3000.
  • Hard to manage (8 distinct terminal windows?).

Attempt 2: The Cluster Module

Node.js has a built-in module called cluster. It allows you to create a "Primary" process that forks "Worker" processes. Crucially, they can all share the same port!

Note: The isPrimary check ensures only the parent process calls fork(). Workers skip this block, preventing infinite recursive creation of processes.

index.js runtime structure
Primary Process
if (cluster.isPrimary) {
// Manager Logic
cluster.fork();
cluster.fork();
}
Worker 1
else {
app.listen(3000);
// Business Logic
}
Worker 2
else {
app.listen(3000);
// Business Logic
}

Advantages

  • No Port Conflicts: All workers share Port 3000 seamlessly. (The underlying server handle(main process) is shared, so they don't fight for the port).
  • Full CPU Utilization: Uses all cores of your machine.
  • Resilient: If one worker crashes, others stay alive.

Disadvantages

  • Browser Sticky Connections: Browsers try to keep the connection alive (Keep-Alive).
    Refresing the page might reuse the same PID. (Test with Incognito, curl or postman to see balancing).
  • Complexity: Code is harder to read/debug.
  • State Management: Workers don't share memory (variables).
06

Capacity Estimation

Before we talk about complex architectures, we need to do the Paper Math. How many servers do you actually need?

The System Design Interview

In almost every system design interview, you will be asked:

  • How would you scale your server?
  • How do you handle sudden traffic spikes?
  • How can you support a certain SLA (Service Level Agreement) given 1M users?

The Strategy: 4 Steps to Scale

1
Paper Math
Estimate users and traffic volume (e.g., 1M DAU).
2
Estimate RPS
Calculate Requests Per Second needed.
3
Benchmark
Test how much 1 single machine can handle.
4
Autoscale
Configure machines to scale based on CPU/RAM load.

Example: PayTM Flash Sale

Scenario: 8:00 PM Flash Sale.

  • Normal Traffic:1,000 req/s
  • Spike Traffic:
    100,000 req/s
  • The Problem: Traffic spikes in "Seconds", but servers take 5 minutes to boot.
requests/sec vs capacity
CRASH 💥
Traffic
Server Capacity

The Solution: Autoscaling Logic

  1. Monitor: Watch metrics (CPU/RPS).
  2. Alarm: Threshold crossed (e.g. > 80% CPU).
  3. Decision: ASG Policy says "Add +3 Servers".
  4. Action: New VMs boot up and register with Load Balancer.
Autoscaling Decision Flow
Decision Engine (ASG)
Load > 80%?
"YES Add +3 Servers"
Aggregate: 1000 req/s
AWS Machine
AWS Machine
AWS Machine
AWS Machine
Monitoring Agents Stream Metrics...

Critical: When RPS Fails

Metrics like req/s are great for simple apps (PayTM, Twitter). But they fail miserably for Heavy Compute Jobs.

The Problem: Use Case
Example: Fab.AI or YouTube Studio.

A single request (e.g., "Generate 4K Video") might take 30 seconds.
Why RPS Fails
You might only have 0.1 req/s, but your server is at 100% CPU. If you scaled on RPS, you wouldn't scale at all.
The Solution: Better Metrics
  • CPU/GPU Utilization: Scale if utilization > 70%.
  • Queue Depth: If there are > 5 jobs pending, add a worker node.
Estimation Formula (Back-of-Envelope)
text
Questions to ask:
1. What is the Read/Write ratio? (Reads are cheap, Writes are hard)
2. Average Request Size? (1KB vs 10MB)
3. Desired Latency? (Do we need to reply in 50ms?)

Capacity = (Traffic * Request_Weight) / Single_Server_Limit
07

Resource Intensive Architectures

The "Heavy Compute" Problem

Horizontal scaling works for simple requests (get user, save post). But if a request takes 30 seconds (e.g., video transcoding, executing user code), your server hits its connection limit instantly.
Autoscaling cannot save you here. It responds too slowly.

Architecture: Queue vs Direct
1
User
Sends Request
2
API Layer
"Job Accepted = PID: #101, returned to user with processing signal"
Redis List
3
#103
#102
#101
Message Queue (Redis/SQS)
Buffer Zone
4
W-1
W-2
+
Workers
Process Jobs Async

Why Autoscaling Fails

  • 1.Connection Bloat: If 1000 users send 10-minute requests, you need 1000 active HTTP connections. Browsers will timeout.
  • 2.Wasted Resources: If you spin up a server for a 10s job, boot time (30s) > Execution time.
  • 3.Loss of Control: If 1M users click "Generate", you crash. With a queue, you just have 1M items in Redis, but your workers run peacefully at their own pace.

Why Queues Win

  • 1.Decoupling: The API responds instantly ("Job Accepted"). The User can poll for status or get a webhook.
  • 2.Cost Efficiency: You can keep 10 workers for 1M jobs. It will just take longer, but it won't crash or cost $1M.
  • 3.Graceful Shutdown: Workers can finish their current video render before shutting down, unlike a web server which might be killed by an autoscale-down event.

What's Next?

We have optimized our single server to use all its CPU cores (Vertical Scaling). But eventually, one server is not enough. You will need 10, 50, or 100 servers.

How do you distribute traffic to them? How do you ensure users don't get sent to a dead server?

For that, we need a Load Balancer.

08

Review: Database Scaling

Stateless app servers are easy to scale. Databases (Stateful) are hard.

Problem: Data Consistency
text
If you just copy the database to 3 servers:
1. User A writes to DB Server 1.
2. User B reads from DB Server 2.
3. User B might not see User A's data yet! (Replication Lag)

1. Read Replicas (Master-Slave)

One Primary node handles all WRITES. Multiple Replica nodes handle READS.

2. Sharding

Splitting data across multiple servers (e.g., Users A-M on Server 1, N-Z on Server 2).