>_
EngineeringNotes
← Back to System Design

Single Point of Failure

Mastering the art of resiliency: Identifying and eliminating the critical vulnerabilities that can bring down your entire system.

What is a SPOF?

"Single points of failure are those points where the entire system can crash in case that point crashes."

Classic Case: If you focus on one database and it crashes, your entire system crashes.

Resiliency Rule:A system with a SPOF is not resilient—it's a house of cards.

Humanity & Earth

"Humanity as a service will die out if there's a big enough asteroid which hits Earth. Earth is our single point of failure."

The Story Analogy

In stories, you might be too attached to one character—you cannot afford to kill them. In systems, that dependency is a bug, not a feature.

The Resilient Pipeline

Visualizing the redundant path from User to Database.

DNS

Multi-IP Resolution

Load Balancers

Redundant Cluster

Server Nodes

Stateless Scaling

Primary
Replica

Database

Master-Slave Mirroring

Resolves Multiple Addresses
Handles Failover automatically
Scale nodes horizontally
Mirrored data across instances
High Availability Blueprint

The Math of Probabilities

The obvious way to mitigate failure is to add another node. If the probability of one node failing is P, adding a mirrored node changes the system failure probability to:

System Risk

"Since P is usually very small (e.g., 0.01), P² becomes exponentially smaller (0.0001). This makes the database significantly more resilient than before."

Multiple Nodes

Instead of one profile server, add a second one. More nodes = More money = Less failure.

Mirrored Backup

Backup services don't make sense if data isn't mirrored. Data must stay in sync.

Beyond Single Nodes:
Multi-Region Strategy

Even with multiple servers and redundant databases, a geographic disaster can take down a whole data center. True resilience requires global distribution.

  • Store data in US-East, US-West, and Europe.
  • Use Global Load Balancers (GSLB) to route traffic.
  • Achieve resilience against localized catastrophes.

Resiliency in Production

"Netflix does this really well. They have something called Chaos Monkey which just randomly goes on production and takes down one node."

Chaos Monkey Policy

Tests if system is really as distributed as you call it.

Advanced Resiliency Patterns

Circuit Breaker

If one service starts failing, stop sending requests to it.This prevents a "Cascading Failure" where the rest of your system waits for a dead component and dies itself.

  • ● Automate health checks
  • ● Fast fail instead of waiting
  • ● Self-healing fallback logic

Active strategies

Active-Passive

One nodes works, one node waits. Simple but wastes resources.

Active-Active (Gold Standard)

All nodes work together. If any one fails, the others pick up the load immediately. No switching time required.

Summary

Your architecture is only as strong as its weakest point. Resiliency is built through redundancy, mirroring, and continuous testing.

The CostRedundancy requires investment.
The RewardExponentially higher reliability.

Eliminate every single point of failure.