System Design, but Simple

System Design, but Simple

Design WhatsApp

Like you should in an interview. Explained as simply as possible… but not simpler.

Stephane Moreau's avatar
Stephane Moreau
Sep 22, 2025
∙ Paid
5
1
Share

In this issue, I walk through the exact thinking I’d use in a system design interview out loud, step by step. Clear, practical, and including trade-offs you can defend.

What you’ll learn in ~15 minutes

  • How I would scope the problem without missing important requirements

  • Why Layer 4 load balancers are critical for WebSocket scaling

  • The inbox table pattern for guaranteed message delivery (how to handle offline users without complex queuing systems)

  • When Redis Pub/Sub outperforms Kafka for messaging

How this issue is structured
I split the write-up into the same sections I’d narrate at a whiteboard. Free readers get the full walkthrough up to the deep-dive parts. Paid members get the 🔒 sections.

  • Initial Thoughts & Clarifying Questions

  • Functional Requirements

  • Non-Functional Requirements

  • Back-of-the-envelope Estimations (QPS, storage, bandwidth, cardinality math)

  • 🔒 System Design (the architecture I’d draw and the excalidraw link for it!)

  • 🔒 Component Breakdown (why each piece exists + alternatives)

  • 🔒 Trade-offs Made

  • 🔒 Security & Privacy

  • 🔒 Monitoring, Logging, and Alerting

  • 🔒 Extensions & Advanced Considerations

Quick note: If you’ve been getting value from these and want the full deep dives, becoming a paid member helps me keep writing—and you’ll immediately unlock the 🔒 sections above, plus a few extras I lean on when I practice.

Members also get

  • 12 Back-of-the-Envelope Calculations Every Engineer Should Know

  • My Excalidraw System Design Template — drop-in canvas you can copy and tweak.

  • My System Design Component Library

Let’s get to it!


Initial Thoughts & Clarifying Questions

To begin, I'd want to understand the specific scope and constraints of this messaging system. Let me ask some clarifying questions to nail down the requirements:

Q: Are we focusing on one-to-one messaging, group chats, or both? I'm assuming we need to support both, with group chats being a generalization where a "group of two" represents one-to-one messaging. This simplifies our data model.

Q: What types of media should we support beyond text messages? I'll assume we need to handle images, videos, and file attachments - essentially any media type that users commonly share. This will significantly impact our storage and bandwidth considerations.

Q: Should we support real-time features like typing indicators, read receipts, or online presence? For the core design, I'm focusing on message delivery, but I'll mention how we could extend this for presence indicators later since it's a common follow-up question.

Q: What's our expected scale in terms of users and messages? Given that WhatsApp serves billions of users, I'm targeting a system that can handle billions of users sending hundreds of billions of messages daily.

Q: Do we need to support multiple devices per user? Initially, I'll design for one device per user to keep things simple, then show how to extend for multiple devices since that's realistic for modern messaging apps.

Q: What are our consistency requirements - do messages need to be delivered in order? I'm assuming eventual consistency is acceptable, but messages within a chat should maintain order.

Q: Any specific privacy or data retention requirements? Following WhatsApp's model, I'll design for minimal data retention - messages should be deleted from servers once delivered to maintain privacy.

Functional Requirements

From what I understand, the core requirements are:

  1. Create and manage group chats - Users can start conversations with individuals or groups

  2. Send and receive text messages - Real-time bidirectional messaging between participants

  3. Send and receive media attachments - Support for images, videos, and files

  4. Access messages after being offline - Message persistence and delivery to disconnected clients

  5. Message delivery guarantees - Ensure messages reach their intended recipients

These requirements form the foundation of our system. Each one builds on the previous, creating natural stepping stones for our design evolution.

Non-Functional Requirements

I'd expect this system to handle several critical performance and operational characteristics:

Latency: We'd likely need sub-500ms message delivery for users who are online. This feels natural - anything slower would make conversations feel sluggish, but the difference between 100ms and 500ms isn't particularly noticeable to users.

Reliability: Message delivery must be guaranteed. It would be unacceptable for messages to simply disappear.

Scale: Supporting billions of users sending hundreds of messages per day. The math here gets interesting when we calculate the actual throughput requirements.

Privacy: Following WhatsApp's philosophy, we should minimize data retention. Messages should be deleted from servers after successful delivery.

Availability: This is a global system that needs to tolerate individual component failures without taking down the entire service. We can't have single points of failure.

Back-of-the-envelope Estimations

Let's say we have 2 billion daily active users... If each user sends 50 messages per day on average, that gives us 100 billion messages daily.

QPS Calculations:

  • Daily messages: 2B users × 50 messages = 100B messages/day

  • Messages per second: 100B / (24 × 3600) ≈ 1.16M messages/sec

  • Read QPS: Assuming each message is read by 2.5 recipients on average = 2.9M reads/sec

  • Peak load (3x average): 3.5M writes/sec, 8.7M reads/sec

Storage Requirements:

  • Average message size: 1KB (including metadata)

  • Daily storage: 100B × 1KB = 100TB/day

  • With 30-day retention: 100TB × 30 = 3PB total

  • But most messages are delivered immediately and deleted, so steady state is probably 10-20% of this = 300-600TB

Bandwidth Estimates:

  • Message traffic: 1.16M/sec × 1KB = 1.16 GB/sec

  • Media uploads (10% of messages, 100KB average): 116K/sec × 100KB = 11.6 GB/sec

  • Total ingress: ~13 GB/sec

  • With replication and delivery overhead: ~40 GB/sec total bandwidth

Memory Requirements:

  • Active connections: Assuming 20% of users online simultaneously = 400M connections

  • Connection overhead: ~10KB per connection = 4TB memory for connection state

  • Message caching: Hot data cache of ~1 hour = 4TB

  • Total memory across all servers: ~8TB

These numbers tell me we're definitely in distributed systems territory - no single machine can handle this load.

🔒 System Design

Here’s the system design I am thinking of:

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Stephane Moreau
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture