Design FB Live Comments

Like you should in an interview. Explained as simply as possible… but not simpler.

Sep 29, 2025

∙ Paid

In this issue, I walk through the exact thinking I’d use in a system design interview out loud, step by step. Clear, practical, and including trade-offs you can defend.

What you’ll learn in ~15 minutes

How I would scope the problem without missing important requirements (custom aliases, expirations, availability).
Why Server-Sent Events are better than WebSockets for one-way real-time updates and how to implement SSE at Facebook scale
The pub-sub architecture pattern that enables millions of concurrent viewers without overwhelming your database
How cursor-based pagination outperforms offset pagination for infinite scroll in high-volume comment streams

How this issue is structured
I split the write-up into the same sections I’d narrate at a whiteboard. Free readers get the full walkthrough up to the deep-dive parts. Paid members get the 🔒 sections.

Initial Thoughts & Clarifying Questions
Functional Requirements
Non-Functional Requirements
Back-of-the-envelope Estimations
🔒 System Design (the architecture I’d draw and the excalidraw link for it!)
🔒 Component Breakdown (why each piece exists + alternatives)
🔒 Trade-offs Made
🔒 Security & Privacy
🔒 Monitoring, Logging, and Alerting
🔒 Final Thoughts

Quick note: If you’ve been getting value from these and want the full deep dives, becoming a paid member helps me keep writing—and you’ll immediately unlock the 🔒 sections above, plus a few extras I lean on when I practice.

Members also get

12 Back-of-the-Envelope Calculations Every Engineer Should Know
My Excalidraw System Design Template — drop-in canvas you can copy and tweak.
My System Design Component Library

Let’s get to it!

Initial Thoughts & Clarifying Questions

To begin, I’d want to understand the scope and constraints better before jumping into the design. Here are the key questions I’d ask the interviewer:

“What’s the expected scale we’re designing for?” - I’m assuming we need to handle millions of concurrent videos with thousands of comments per second per video. This tells me we’re designing for massive scale.

“How real-time do the comments need to be?” - I assume sub-200ms latency, which means we’re talking about truly real-time communication, not just “fast enough”.

“Do we need to handle comment moderation or spam filtering?” - I’ll assume this is out of scope for now, but it’s good to acknowledge that production systems would need robust content filtering.

“What about comment persistence - how long should we store comments?” - I’m assuming we need to store comments indefinitely for historical viewing, which impacts our storage design significantly.

“Do we need to support features like reactions, replies, or threading?” - I’ll assume these are out of scope to keep the core problem manageable, though I’d mention them as potential extensions.

“What platforms need to be supported?” - I’m assuming web and mobile clients, both of which can handle persistent connections reasonably well.

Functional Requirements

From what I understand, the core requirements are prioritized as follows:

Primary Requirements:

Users can post comments on live video feeds with immediate persistence
Viewers see new comments in real-time as they’re posted by others
Users can view historical comments when joining a live stream
Comments must be displayed in chronological order with proper pagination

Secondary Requirements:

Users can scroll through comment history (infinite scroll pattern)
The system gracefully handles user connections and disconnections
Comments remain associated with the correct live video stream

We’re building both a real-time messaging system and a persistent chat history system. The real-time aspect is what makes this challenging at scale.

Non-Functional Requirements

I’d expect this system to handle some serious performance and scale requirements:

Scale Expectations:

Support millions of concurrent live videos simultaneously
Handle thousands of comments per second per popular live video
Accommodate millions of concurrent viewers across all streams

Performance Requirements:

Sub-200ms end-to-end latency for comment delivery
High availability with minimal downtime (99.9%+ uptime)
Eventual consistency is acceptable - comments don’t need to appear in exactly the same order for all users immediately

User Experience:

Seamless real-time experience that feels instantaneous
Reliable comment delivery without missing messages
Smooth infinite scrolling for historical comments
Graceful degradation when network conditions are poor

The 200ms latency requirement is particularly important because that’s the threshold where humans perceive interactions as real-time versus delayed.

Back-of-the-envelope Estimations

Let’s work through the numbers to understand what we’re dealing with:

User Activity Assumptions:

10 million daily active users watching live videos
Peak concurrent users: ~2 million (20% of DAU)
Average of 1,000 concurrent live videos during peak
Popular videos might have 100,000+ concurrent viewers

Comment Volume Calculations:

Average user posts 5 comments per session
Active commenting users: ~20% of viewers
For a popular video with 100K viewers: 20K active commenters
If each commenter posts once every 2 minutes: ~167 comments/second per popular video
Across 1,000 concurrent videos: ~167,000 comments/second system-wide

Storage Requirements:

Average comment size: ~100 bytes (including metadata)
Daily comments: 10M users × 5 comments = 50M comments
Daily storage: 50M × 100 bytes = 5GB per day
Annual storage: ~1.8TB (manageable, but we’ll want compression and archiving)

Bandwidth Estimates:

Each comment needs to be delivered to all viewers of that video
Popular video with 100K viewers receiving 167 comments/second
Outbound traffic per video: 167 × 100 bytes × 100K viewers = 1.67GB/second
This shows why we need efficient distribution mechanisms

Connection Requirements:

2 million concurrent SSE connections during peak
If each server handles 50K connections: need ~40 servers minimum
With redundancy and load distribution: ~100+ servers for connection handling

These numbers confirm we need a distributed architecture with efficient real-time communication.

🔒 System Design

This is the simple design I would draw: