Design WhatsApp
Like you should in an interview. Explained as simply as possible… but not simpler.
In this issue, I walk through the exact thinking I’d use in a system design interview out loud, step by step. Clear, practical, and including trade-offs you can defend.
What you’ll learn in ~15 minutes
How I would scope the problem without missing important requirements
Why Layer 4 load balancers are critical for WebSocket scaling
The inbox table pattern for guaranteed message delivery (how to handle offline users without complex queuing systems)
When Redis Pub/Sub outperforms Kafka for messaging
How this issue is structured
I split the write-up into the same sections I’d narrate at a whiteboard. Free readers get the full walkthrough up to the deep-dive parts. Paid members get the 🔒 sections.
Initial Thoughts & Clarifying Questions
Functional Requirements
Non-Functional Requirements
Back-of-the-envelope Estimations (QPS, storage, bandwidth, cardinality math)
🔒 System Design (the architecture I’d draw and the excalidraw link for it!)
🔒 Component Breakdown (why each piece exists + alternatives)
🔒 Trade-offs Made
🔒 Security & Privacy
🔒 Monitoring, Logging, and Alerting
🔒 Extensions & Advanced Considerations
Quick note: If you’ve been getting value from these and want the full deep dives, becoming a paid member helps me keep writing—and you’ll immediately unlock the 🔒 sections above, plus a few extras I lean on when I practice.
Members also get
12 Back-of-the-Envelope Calculations Every Engineer Should Know
My Excalidraw System Design Template — drop-in canvas you can copy and tweak.
My System Design Component Library
Let’s get to it!
Initial Thoughts & Clarifying Questions
To begin, I'd want to understand the specific scope and constraints of this messaging system. Let me ask some clarifying questions to nail down the requirements:
Q: Are we focusing on one-to-one messaging, group chats, or both? I'm assuming we need to support both, with group chats being a generalization where a "group of two" represents one-to-one messaging. This simplifies our data model.
Q: What types of media should we support beyond text messages? I'll assume we need to handle images, videos, and file attachments - essentially any media type that users commonly share. This will significantly impact our storage and bandwidth considerations.
Q: Should we support real-time features like typing indicators, read receipts, or online presence? For the core design, I'm focusing on message delivery, but I'll mention how we could extend this for presence indicators later since it's a common follow-up question.
Q: What's our expected scale in terms of users and messages? Given that WhatsApp serves billions of users, I'm targeting a system that can handle billions of users sending hundreds of billions of messages daily.
Q: Do we need to support multiple devices per user? Initially, I'll design for one device per user to keep things simple, then show how to extend for multiple devices since that's realistic for modern messaging apps.
Q: What are our consistency requirements - do messages need to be delivered in order? I'm assuming eventual consistency is acceptable, but messages within a chat should maintain order.
Q: Any specific privacy or data retention requirements? Following WhatsApp's model, I'll design for minimal data retention - messages should be deleted from servers once delivered to maintain privacy.
Functional Requirements
From what I understand, the core requirements are:
Create and manage group chats - Users can start conversations with individuals or groups
Send and receive text messages - Real-time bidirectional messaging between participants
Send and receive media attachments - Support for images, videos, and files
Access messages after being offline - Message persistence and delivery to disconnected clients
Message delivery guarantees - Ensure messages reach their intended recipients
These requirements form the foundation of our system. Each one builds on the previous, creating natural stepping stones for our design evolution.
Non-Functional Requirements
I'd expect this system to handle several critical performance and operational characteristics:
Latency: We'd likely need sub-500ms message delivery for users who are online. This feels natural - anything slower would make conversations feel sluggish, but the difference between 100ms and 500ms isn't particularly noticeable to users.
Reliability: Message delivery must be guaranteed. It would be unacceptable for messages to simply disappear.
Scale: Supporting billions of users sending hundreds of messages per day. The math here gets interesting when we calculate the actual throughput requirements.
Privacy: Following WhatsApp's philosophy, we should minimize data retention. Messages should be deleted from servers after successful delivery.
Availability: This is a global system that needs to tolerate individual component failures without taking down the entire service. We can't have single points of failure.
Back-of-the-envelope Estimations
Let's say we have 2 billion daily active users... If each user sends 50 messages per day on average, that gives us 100 billion messages daily.
QPS Calculations:
Daily messages: 2B users × 50 messages = 100B messages/day
Messages per second: 100B / (24 × 3600) ≈ 1.16M messages/sec
Read QPS: Assuming each message is read by 2.5 recipients on average = 2.9M reads/sec
Peak load (3x average): 3.5M writes/sec, 8.7M reads/sec
Storage Requirements:
Average message size: 1KB (including metadata)
Daily storage: 100B × 1KB = 100TB/day
With 30-day retention: 100TB × 30 = 3PB total
But most messages are delivered immediately and deleted, so steady state is probably 10-20% of this = 300-600TB
Bandwidth Estimates:
Message traffic: 1.16M/sec × 1KB = 1.16 GB/sec
Media uploads (10% of messages, 100KB average): 116K/sec × 100KB = 11.6 GB/sec
Total ingress: ~13 GB/sec
With replication and delivery overhead: ~40 GB/sec total bandwidth
Memory Requirements:
Active connections: Assuming 20% of users online simultaneously = 400M connections
Connection overhead: ~10KB per connection = 4TB memory for connection state
Message caching: Hot data cache of ~1 hour = 4TB
Total memory across all servers: ~8TB
These numbers tell me we're definitely in distributed systems territory - no single machine can handle this load.
🔒 System Design
Here’s the system design I am thinking of: