Module 09 - Chapter 01

Vertical vs Horizontal

The fundamental decision in system design. Make it bigger or make more of them?

Vertical Scaling

Often called "Scaling Up". You simply upgrade the server (CPU, RAM).

t2.micro

→

t2.large

→

96 vCPU

m5.24xlarge

Same server, just more power.

Pros

✔ No code changes required.
✔ Simple to manage (1 server).
✔ Good for database performance initially.

Cons

✖ Hardware Limit (Cannot scale infinitely).
✖ Downtime required to upgrade.
✖ Single Point of Failure.

Analogy

text

Like upgrading from a Toyota Corolla engine to a Ferrari engine. It's still one car, just faster.

Horizontal Scaling

Often called "Scaling Out". You add more servers to the pool.

Server 1

Server 2

Server 3

Many small servers (Commodity Hardware) working together.

Pros

✔ Infinite Scaling (theoretically).
✔ High Availability (If one dies, others work).
✔ No downtime upgrades (Rolling Updates).

Cons

✖ Complex ecosystem (Load Balancers required).
✖ Data Consistency challenges.
✖ Network overhead.

Analogy

text

Instead of one Ferrari, you buy 10 Toyota Corollas and hire 10 drivers. You can carry 10x more people.

When to Choose?

Vertical Scaling

Early Stage: Simple to set up, perfect for MVPs.
Databases: Often easier to vertical scale a master DB than shard it.
Monolithic Apps:Legacy apps that can't run on multiple servers.

Horizontal Scaling

High Traffic: When you hit the limit of the biggest machine.
High Availability: Critical apps that cannot tolerate downtime.
Microservices: Naturally fits independent services.

The Single-Threaded Bottleneck

Vertical scaling has a hidden trap for modern languages. Even if you buy a monster 64-Core server, your app might only use 1 Core.

Server (8 Cores)

Only Core #1 is providing value!

Node.js is Single Threaded

Node.js runs on the Event Loop. It is designed to be single-threaded.

If you have an 8-core CPU, Node.js will use 1 core.
The other 7 cores sit IDLE (wasted money).

To fix this, we need special techniques (Cluster Module / Worker Threads).
(More on this on next section...)

Solutions: How to use all Cores?

Attempt 1: Run the same file multiple times

If we have 8 cores, why not just open 8 terminals and run node index.js 8 times?

$ node index.js

Running on PID 4001... Listening on 3000

$ node index.js

Error: listen EADDRINUSE: address already in use :::3000

Advantages

Works great for general scripts (image processing, cron jobs).
Simple to understand.

Disadvantages

Port Conflicts:You can't bind two processes to port 3000.
Hard to manage (8 distinct terminal windows?).

Attempt 2: The Cluster Module

Node.js has a built-in module called cluster. It allows you to create a "Primary" process that forks "Worker" processes. Crucially, they can all share the same port!

Note: The isPrimary check ensures only the parent process calls fork(). Workers skip this block, preventing infinite recursive creation of processes.

index.js runtime structure

Primary Process

if (cluster.isPrimary) {

// Manager Logic

cluster.fork();

}

Worker 1

else {

app.listen(3000);

// Business Logic

}

Worker 2

else {

app.listen(3000);

// Business Logic

}

Advantages

No Port Conflicts: All workers share Port 3000 seamlessly. (The underlying server handle(main process) is shared, so they don't fight for the port).
Full CPU Utilization: Uses all cores of your machine.
Resilient: If one worker crashes, others stay alive.

Disadvantages

Browser Sticky Connections: Browsers try to keep the connection alive (Keep-Alive).
Refresing the page might reuse the same PID. (Test with Incognito, curl or postman to see balancing).
Complexity: Code is harder to read/debug.
State Management: Workers don't share memory (variables).

Capacity Estimation

Before we talk about complex architectures, we need to do the Paper Math. How many servers do you actually need?

The System Design Interview

In almost every system design interview, you will be asked:

How would you scale your server?
How do you handle sudden traffic spikes?
How can you support a certain SLA (Service Level Agreement) given 1M users?

The Strategy: 4 Steps to Scale

Paper Math

Estimate users and traffic volume (e.g., 1M DAU).

Estimate RPS

Calculate Requests Per Second needed.

Benchmark

Test how much 1 single machine can handle.

Autoscale

Configure machines to scale based on CPU/RAM load.

Example: PayTM Flash Sale

Scenario: 8:00 PM Flash Sale.

Normal Traffic:1,000 req/s
Spike Traffic:
100,000 req/s
The Problem:Traffic spikes in "Seconds", but servers take 5 minutes to boot.

requests/sec vs capacity

CRASH 💥

Traffic

Server Capacity

The Solution: Autoscaling Logic

Monitor: Watch metrics (CPU/RPS).
Alarm:Threshold crossed (e.g. > 80% CPU).
Decision:ASG Policy says "Add +3 Servers".
Action: New VMs boot up and register with Load Balancer.

Autoscaling Decision Flow

Decision Engine (ASG)

Load > 80%?

"YES Add +3 Servers"

▼

Aggregate: 1000 req/s

AWS Machine

Monitoring Agents Stream Metrics...

Critical: When RPS Fails

Metrics like req/s are great for simple apps (PayTM, Twitter). But they fail miserably for Heavy Compute Jobs.

The Problem: Use Case

Example: Fab.AI or YouTube Studio.

A single request (e.g., "Generate 4K Video") might take 30 seconds.

Why RPS Fails

You might only have 0.1 req/s, but your server is at 100% CPU. If you scaled on RPS, you wouldn't scale at all.

The Solution: Better Metrics

✓CPU/GPU Utilization:Scale if utilization > 70%.
✓Queue Depth:If there are > 5 jobs pending, add a worker node.

Estimation Formula (Back-of-Envelope)

text

Questions to ask:
1. What is the Read/Write ratio? (Reads are cheap, Writes are hard)
2. Average Request Size? (1KB vs 10MB)
3. Desired Latency? (Do we need to reply in 50ms?)

Capacity = (Traffic * Request_Weight) / Single_Server_Limit

Resource Intensive Architectures

The "Heavy Compute" Problem

Horizontal scaling works for simple requests (get user, save post). But if a request takes 30 seconds (e.g., video transcoding, executing user code), your server hits its connection limit instantly.
Autoscaling cannot save you here. It responds too slowly.

Architecture: Queue vs Direct

User

Sends Request

API Layer

"Job Accepted = PID: #101, returned to user with processing signal"

Redis List

#103

#102

#101

Message Queue (Redis/SQS)

Buffer Zone

W-1

W-2

Workers

Process Jobs Async

Why Autoscaling Fails

1.Connection Bloat: If 1000 users send 10-minute requests, you need 1000 active HTTP connections. Browsers will timeout.
2.Wasted Resources:If you spin up a server for a 10s job, boot time (30s) > Execution time.
3.Loss of Control:If 1M users click "Generate", you crash. With a queue, you just have 1M items in Redis, but your workers run peacefully at their own pace.

Why Queues Win

1.Decoupling:The API responds instantly ("Job Accepted"). The User can poll for status or get a webhook.
2.Cost Efficiency:You can keep 10 workers for 1M jobs. It will just take longer, but it won't crash or cost $1M.
3.Graceful Shutdown: Workers can finish their current video render before shutting down, unlike a web server which might be killed by an autoscale-down event.

What's Next?

We have optimized our single server to use all its CPU cores (Vertical Scaling). But eventually, one server is not enough. You will need 10, 50, or 100 servers.

How do you distribute traffic to them? How do you ensure users don't get sent to a dead server?

For that, we need a Load Balancer.

Next ChapterLoad Balancing & Auto Scaling Manual Balancers vs AWS Auto Scaling Groups

Review: Database Scaling

Stateless app servers are easy to scale. Databases (Stateful) are hard.

Problem: Data Consistency

text

If you just copy the database to 3 servers:
1. User A writes to DB Server 1.
2. User B reads from DB Server 2.
3. User B might not see User A's data yet! (Replication Lag)

1. Read Replicas (Master-Slave)

One Primary node handles all WRITES. Multiple Replica nodes handle READS.

2. Sharding

Splitting data across multiple servers (e.g., Users A-M on Server 1, N-Z on Server 2).