You're in a system design interview. The interviewer asks: "How would you handle 1 million requests per second?" Your mind races. Where do you even start?
I've been on both sides of that question. And honestly, the first time I heard it, I said something like "add more servers and use Kafka" — which is roughly the worst possible answer. Not because it's wrong, but because it shows you're pattern-matching instead of thinking.
This question comes up constantly in senior and FAANG-level interviews, and for good reason. Twitter, Netflix, Uber — they're all operating at this scale or beyond. What interviewers want to see isn't that you know tool names. It's whether you can work through a problem systematically, do the math, and know why each decision makes sense.
This article is my attempt to break that down properly — layer by layer, with actual numbers. It draws heavily from Alex Xu's System Design Interview book, which I'd consider required reading, combined with what I've learned running RabbitMQ in production and digging into scaling problems the hard way.
First, what NOT to say
Let's get the anti-patterns out of the way because I've said most of these myself at some point:
- ❌ "Just use Kafka!" — Kafka is great. But why Kafka? For what part? At what throughput?
- ❌ "Scale vertically" — Sure, up to a point. A 192-core machine still has limits, and it costs a fortune
- ❌ "Add more servers" — How many? What's your trigger? What happens during a deploy?
- ❌ Jumping straight to architecture without asking a single clarifying question
The last one is the one that kills candidates. Before you draw a single box, you need to understand what you're actually building.
Start with the math, not the diagram
Before touching architecture, do the estimation. Interviewers genuinely respect this — it shows you're not just winging it.
1M requests/second
= 86.4 billion requests/day
= 2.6 trillion requests/month
If average response payload = 1KB:
= ~1 GB/sec of data transfer
= ~86 TB/day
AWS outbound transfer ~$0.08–$0.12/GB (varies by region) → this alone is a budget conversationThose numbers should make you pause. They'll also probably make you want to ask some questions — which is exactly the right instinct.
Before drawing anything, ask:
- Read-heavy or write-heavy? (90/10 or 50/50 changes everything)
- What's actually in those requests? API calls? Static assets? A mix?
- Single region or global?
- What's the acceptable p99 latency? 100ms? 500ms?
I'm not being rhetorical here — the architecture genuinely looks different depending on the answers. A read-heavy, globally distributed API with mostly static content is a completely different beast than a write-heavy transactional system.
The architecture, layer by layer
For this walkthrough, I'll assume a reasonably balanced API workload — no static CDN content, mostly read-heavy (80/20), globally distributed, p99 target of 200ms.
Here's the skeleton:
Internet
↓
Load Balancer (AWS ELB)
↓
API Servers (~150 instances, auto-scaling)
↓
Kafka Cluster (100 partitions, 3x replication)
↓
Consumer Workers (~1,000 instances)
↓
Redis Cache (targeting 80% hit rate)
↓
Database (sharded PostgreSQL + read replicas)Now let's go through each one.
CDN — only if you actually need it
If your 1M req/sec includes static assets — images, JS bundles, CSS — a CDN with an 80-90% cache hit rate means the majority of requests never reach your infrastructure. That's huge.
But if you're serving pure API responses, a CDN won't help much. It caches; it doesn't process.
I'll admit I got confused on this early when I was studying this problem. I kept thinking about DNS caching and CDN interchangeably, which is wrong. DNS resolves domain → IP. It does nothing to reduce request volume. For pure API traffic, skip CDN and go straight to the load balancer. Keep it simple.
Load Balancer
The load balancer distributes requests across your API servers, does health checks, and handles SSL termination. Not glamorous, but it's the front door to everything.
| Option | Pros | Cons |
|---|---|---|
| AWS ELB / Azure LB | Fully managed, nothing to operate | Less control |
| NGINX / HAProxy | Highly configurable, good for edge cases | You own it |
My default is cloud-managed. At this scale, the last thing you want is to be debugging a load balancer config at 3am. Use round-robin or least-connections, keep sessions stateless if at all possible, and set health checks to catch dead nodes fast.
API Servers — actually do the math
This is where the back-of-envelope pays off. "Add more servers" is useless without a number attached.
Real-world server throughput (these vary, but this is a reasonable ballpark):
| Runtime | Requests/sec per server |
|---|---|
| Go / Rust | 10,000 – 20,000 |
| Node.js | 5,000 – 10,000 |
| Python / Django | 1,000 – 2,000 |
Working through it for Node.js:
1,000,000 req/sec ÷ 10,000 per server = 100 servers minimum
Add 50% buffer for:
- Traffic spikes
- Rolling deployments (some capacity always offline)
- Server failures
= ~150 servers as your baselineAuto-scaling config in AWS would look roughly like this:
min_size: 100
max_size: 300
desired_capacity: 150
scaling_policy:
- metric: CPUUtilization
threshold: 70
cooldown: 300 # 5 minutes
adjustment: +20 # add 20 servers per triggerOne thing worth noting: the cooldown period matters more than most people think. If you scale too aggressively without cooldown, you end up in a flapping loop where you're constantly spinning up and tearing down instances, which causes its own instability.
Message Queue — Kafka vs RabbitMQ
Async processing is the key to keeping API response times fast. Instead of making the user wait while you write to three databases and fire off two external API calls, you publish an event and respond immediately. The heavy lifting happens in the background.
The Kafka vs RabbitMQ question comes up a lot, so here's how I think about it:
| RabbitMQ | Kafka | |
|---|---|---|
| Best for | Task queues (email, payments, notifications) | Event streaming (logs, activity, pipelines) |
| Throughput | Up to ~100K msg/sec | 1M+ msg/sec |
| Consumer model | Each message consumed once | Multiple consumers can read same events |
| Setup complexity | Fairly simple | More involved |
Honest take: In my day-to-day work, we use RabbitMQ for things like sending emails and running background jobs. It handles around 10K tasks a day without breaking a sweat, and the operational overhead is minimal. I like it for that use case. But for 1M req/sec? Kafka's partition-based parallelism is the only realistic option. More on that in a second.
Kafka cluster sizing:
Brokers: 10–20 servers
Partitions: 100–200 (this is your parallelism knob)
Replication factor: 3 (survives 2 broker failures simultaneously)
Partition throughput is best expressed in MB/s — it varies heavily with message size, batching, and compression. Think of partition count as your parallelism dial, not a precise throughput number. 100 partitions gives you substantial headroom for our workload.The partition count is the number you want to think carefully about. You can increase brokers, but changing partition count later is painful. Start with more than you think you need.
Consumer Workers — the workers math is humbling
How many worker instances do you need? Depends entirely on processing time per message:
If each message takes 100ms to process:
1 worker = 10 messages/sec
Need: 1,000,000 ÷ 10 = 100,000 workers — obviously not practical
If each message takes 10ms:
1 worker = 100 messages/sec
Need: 10,000 workers — expensive
If each message takes 1ms:
1 worker = 1,000 messages/sec
Need: 1,000 workers — manageableThis math is why optimizing the hot path inside your workers matters so much. Shaving 5ms off processing time doesn't just make things faster — it cuts your infrastructure costs dramatically.
Practical optimizations:
- Batch processing: Consume 100 messages at once, not one at a time. Most databases handle bulk inserts far better than individual writes
- Profile your slow paths: Nine times out of ten, there's one database query or external API call doing most of the damage
- Language choice matters: A Go worker can genuinely be 5–10x more efficient than a Python one at CPU-bound processing
Redis Caching — this is the one that saves your database
Without caching, you're routing 1M req/sec to your database. Your database will not survive. This isn't a pessimistic take — it's just arithmetic.
Without cache:
1M req/sec → hits the database → the database catches fire
With 80% cache hit rate:
800K req/sec → Redis (sub-millisecond, in-memory)
200K req/sec → Database (this it can handle)Redis cluster sizing:
Single node: ~50K–200K ops/sec (depends heavily on payload size, command mix, and pipelining)
5-node cluster: comfortable headroom for our workload
With our 80% hit rate:
800K reads/sec served by Redis
~50K writes/sec + misses hitting Redis
5 nodes is enough, but add 30% headroom for spikes and node failuresWhat's worth caching:
| Data | TTL | Why |
|---|---|---|
| User profiles | 15 minutes | High read, rare writes |
| Product catalog | 1 hour | Changes infrequently |
| API responses | 1–5 minutes | Depends on freshness requirements |
| Sessions | 30 minutes | Standard pattern |
What you should not cache: financial transactions, anything where stale data causes real damage (stock prices, live auction state). Cache is an optimization, not a source of truth.
Database — the write-heavy case is scarier than it looks
The strategy splits entirely based on your read/write ratio.
Read-heavy (90/10):
Writes: 100K/sec → 1 primary handles this
Reads: 900K/sec → after 80% cache hit → 180K reads/sec remain
PostgreSQL replica throughput: ~12K reads/sec each
Need: 180K ÷ 12K = 15 read replicas
Setup: 1 primary + 15 replicas + a read load balancer in frontWrite-heavy (50/50) — this is where things get uncomfortable:
Writes: 500K/sec
PostgreSQL single primary ceiling: roughly 10K–30K writes/sec
(heavily dependent on schema, indexes, hardware, and transaction size)
You need sharding.
Using 10K as a conservative anchor: 500K ÷ 10K = 50 shards minimum
Each shard: 1 primary + 5 read replicas
Total: 50 × 6 = 300 database instances
Yes, 300. Welcome to distributed systems.
At this write volume it's also worth asking whether PostgreSQL is the right tool at all — Cassandra, ScyllaDB, or a write-behind/CQRS pattern might be a better fit depending on the access patterns.Sharding strategies each have tradeoffs:
- Hash-based:
shard_id = hash(user_id) % 50— even distribution, easy to implement, hard to do range queries - Range-based: Users 1–1M on shard 1, etc. — great for range queries, terrible if your ID distribution isn't uniform
- Geographic: US → US shards, EU → EU shards — good for latency and data residency, complex to operate
The things people underestimate about sharding: cross-shard JOINs are painful (you basically can't do them), rebalancing when you grow is a whole project in itself, and the celebrity problem — where one shard gets disproportionate load because a viral user lives there — is a real production headache that pure hash-based sharding doesn't fully solve.
The stuff the architecture diagrams leave out
This is the part that actually separates senior engineers in interviews. The boxes in the diagram are the easy part. Everything below is what keeps systems running at 3am.
Dead Letter Queues and retry logic
When a worker fails to process a message, you need a plan:
Message → Worker → fails
↓
Exponential backoff retry:
Attempt 1: wait 1s
Attempt 2: wait 2s
Attempt 3: wait 4s
Attempt 4: wait 8s
Cap at ~5 minutes
↓
Still failing? → Dead Letter QueueThe DLQ is where messages go when they've failed retries exhausted. You need to monitor it actively — DLQ depth going up means something is systematically broken, not just transiently bad. Alert on it, review it daily, understand why things end up there.
Idempotency — the thing that bites you eventually
At scale, you will process the same request twice. Network timeouts, client retries, deploy hiccups — it happens. If your system isn't idempotent, the user gets charged twice or sees a duplicate order.
// Client sends a unique requestId every time
POST /api/orders
{
"requestId": "uuid-12345-abc",
"item": "product-789",
"quantity": 1
}
// Before processing, check if we've seen this before
const key = `processed:${requestId}`;
if (await redis.exists(key)) {
return await redis.get(key); // Already done, return cached result
}
const result = await processOrder(payload);
await redis.set(key, result, { EX: 86400 }); // 24hr TTL
return result;This isn't theoretical. In my current project we submit license orders to a Microsoft API. We always pass a
requestId. If the request times out and we retry, Microsoft checks the ID and returns the same response instead of creating a duplicate order. Same pattern on both sides of the call. It's saved us from some embarrassing incidents.
Rate limiting — protect yourself before users abuse you
Reasonable limits:
Per user: 1,000 req/min
Per IP: 10,000 req/min
Per API key: 100,000 req/minSimple Redis sliding window:
async function checkRateLimit(userId) {
const key = `rate:${userId}:${Math.floor(Date.now() / 60000)}`;
const count = await redis.incr(key);
await redis.expire(key, 60);
if (count > 1000) {
return { allowed: false, retryAfter: 60 };
}
return { allowed: true };
}Monitoring — you need this before everything else, not after
There's a temptation to treat observability as something you add once the system is built. That's backwards. At this scale, you're essentially flying blind without it.
What to track:
| Metric | Alert threshold |
|---|---|
| Error rate | > 5% → Critical |
| p99 latency | > 1s → Critical |
| p95 latency | > 500ms → Warning |
| Kafka consumer lag | > 100K messages → Critical |
| DLQ depth | > 100 messages → Warning |
| Cache hit rate | < 70% → Warning |
| DB replication lag | > 5s → Critical |
| CPU / Memory | > 80%/85% → Warning |
Tools I'd actually use:
- Prometheus + Grafana for metrics and dashboards
- ELK stack for log aggregation and search
- Jaeger or Zipkin for distributed tracing (invaluable for finding where latency is actually hiding)
- DataDog or New Relic if the budget is there — the out-of-the-box integrations save real time
What this actually costs
I think this section is underrated in system design discussions. Architecture at this scale costs real money, and you should be able to talk about it.
Monthly AWS estimate (on-demand pricing):
| Component | Setup | Monthly |
|---|---|---|
| API Servers | 150 × t3.large @ $0.08/hr | $8,640 |
| Kafka | 20 × r5.xlarge @ $0.25/hr | $3,600 |
| Workers | 1,000 × t3.medium @ $0.04/hr | $28,800 |
| Redis | 5 × r5.large @ $0.13/hr | $468 |
| Database | 50 × db.r5.2xlarge @ $0.60/hr | $21,600 |
| Load Balancers | — | ~$500 |
| Total | ~$63,600/month |
That's ~$763K/year at on-demand pricing. These figures are illustrative — instance pricing varies by region and changes over time, so verify against current AWS pricing before any real planning. With reserved instances on stable workloads and spot instances for workers (with proper shutdown handling), you can realistically get to $300–400K/year. Still a significant number, but that's what operating at scale looks like.
How to actually present this in an interview
The first 60–90 seconds should be a high-level summary, then let the interviewer steer. Don't try to say everything — that's another common mistake.
Something like:
"For 1M req/sec, I'd go with a multi-layered setup. Load balancer in front distributing to ~150 API servers with auto-scaling at 70% CPU. Servers publish events to Kafka — 100 partitions, replication factor 3. Around 1,000 workers consume from Kafka. Redis in front of the database targeting 80% hit rate, which gets effective database load down to ~200K req/sec.
On the database side, if read-heavy, one primary and ~15 read replicas. If write-heavy, we'd need sharding — probably 50 shards. Total cost on AWS is roughly $64K/month on-demand, around $30–40K optimized.
Operationally: rate limiting at the gateway, idempotency keys on critical operations, DLQs with exponential backoff, Prometheus and Grafana for observability.
What would you like to dig into?"*
Then stop talking. The follow-up questions are where you actually demonstrate depth.
Common ones you should be ready for:
- "What if Kafka goes down?" → Replication factor 3 means you survive 2 broker failures. Leader election handles failover automatically. Clients should retry with backoff.
- "How do you handle celebrity users?" → Pre-warm cache for known high-traffic accounts. Separate priority queues. Pre-computed aggregates for read-heavy celebrity data.
- "Database hotspots?" → Monitor per-shard load in Grafana. Consistent hashing helps distribute evenly. Celebrity shards get extra replicas.
- "Zero-downtime deploys?" → Blue-green for API servers (stateless, easy). Canary releases for Kafka consumer changes (consumers are stateful, more careful rollout needed).
What I actually learned writing this
I went into this thinking I understood scaling reasonably well. The exercise of doing the actual math on everything — workers, shards, Redis nodes, replicas — was more humbling than I expected.
A few things that shifted:
The worker count math is what gets people. It's easy to say "spin up more workers" until you realize that 100ms average processing time means you need 100,000 of them. Suddenly optimizing your hot path from 100ms to 1ms isn't a nice-to-have — it's a 100x reduction in infrastructure cost.
Caching isn't an optimization at this scale. It's structural. Without it, the database numbers simply don't work, regardless of how well you've sharded.
And the cost section genuinely surprised me. I knew it would be expensive. I didn't have a specific number in my head. $64K/month is a real thing to say in an interview — it shows you're thinking like someone who's going to own this in production, not just design it on a whiteboard and hand it off.
Further reading
- Alex Xu — System Design Interview: An Insider's Guide — the most practical starting point I've found
- Martin Kleppmann — Designing Data-Intensive Applications — if you want to go deep on the database and streaming theory
- Kafka documentation on partitions and consumer groups — worth reading past the quickstart
- Redis Cluster spec — surprisingly readable, good for understanding sharding internals
What would you add to this? If you've actually operated systems at this scale, I'm curious what I'm missing or oversimplifying — drop a comment or reach out on LinkedIn.