Scaling to 1 Million Requests Per Second: A Practical Guide

February 23, 202612 min read

You're in a system design interview. The interviewer asks: "How would you handle 1 million requests per second?" Your mind races. Where do you even start?

I've been on both sides of that question. And honestly, the first time I heard it, I said something like "add more servers and use Kafka" — which is roughly the worst possible answer. Not because it's wrong, but because it shows you're pattern-matching instead of thinking.

This question comes up constantly in senior and FAANG-level interviews, and for good reason. Twitter, Netflix, Uber — they're all operating at this scale or beyond. What interviewers want to see isn't that you know tool names. It's whether you can work through a problem systematically, do the math, and know why each decision makes sense.

This article is my attempt to break that down properly — layer by layer, with actual numbers. It draws heavily from Alex Xu's System Design Interview book, which I'd consider required reading, combined with what I've learned running RabbitMQ in production and digging into scaling problems the hard way.

First, what NOT to say

Let's get the anti-patterns out of the way because I've said most of these myself at some point:

❌ "Just use Kafka!" — Kafka is great. But why Kafka? For what part? At what throughput?
❌ "Scale vertically" — Sure, up to a point. A 192-core machine still has limits, and it costs a fortune
❌ "Add more servers" — How many? What's your trigger? What happens during a deploy?
❌ Jumping straight to architecture without asking a single clarifying question

The last one is the one that kills candidates. Before you draw a single box, you need to understand what you're actually building.

Start with the math, not the diagram

Before touching architecture, do the estimation. Interviewers genuinely respect this — it shows you're not just winging it.

1M requests/second
= 86.4 billion requests/day
= 2.6 trillion requests/month

If average response payload = 1KB:
= ~1 GB/sec of data transfer
= ~86 TB/day

AWS outbound transfer ~$0.08–$0.12/GB (varies by region) → this alone is a budget conversation

Those numbers should make you pause. They'll also probably make you want to ask some questions — which is exactly the right instinct.

Before drawing anything, ask:

Read-heavy or write-heavy? (90/10 or 50/50 changes everything)
What's actually in those requests? API calls? Static assets? A mix?
Single region or global?
What's the acceptable p99 latency? 100ms? 500ms?

I'm not being rhetorical here — the architecture genuinely looks different depending on the answers. A read-heavy, globally distributed API with mostly static content is a completely different beast than a write-heavy transactional system.

The architecture, layer by layer

For this walkthrough, I'll assume a reasonably balanced API workload — no static CDN content, mostly read-heavy (80/20), globally distributed, p99 target of 200ms.

Here's the skeleton:

Internet
  ↓
Load Balancer (AWS ELB)
  ↓
API Servers (~150 instances, auto-scaling)
  ↓
Kafka Cluster (100 partitions, 3x replication)
  ↓
Consumer Workers (~1,000 instances)
  ↓
Redis Cache (targeting 80% hit rate)
  ↓
Database (sharded PostgreSQL + read replicas)

Now let's go through each one.

CDN — only if you actually need it

If your 1M req/sec includes static assets — images, JS bundles, CSS — a CDN with an 80-90% cache hit rate means the majority of requests never reach your infrastructure. That's huge.

But if you're serving pure API responses, a CDN won't help much. It caches; it doesn't process.

I'll admit I got confused on this early when I was studying this problem. I kept thinking about DNS caching and CDN interchangeably, which is wrong. DNS resolves domain → IP. It does nothing to reduce request volume. For pure API traffic, skip CDN and go straight to the load balancer. Keep it simple.

Load Balancer

The load balancer distributes requests across your API servers, does health checks, and handles SSL termination. Not glamorous, but it's the front door to everything.

Option	Pros	Cons
AWS ELB / Azure LB	Fully managed, nothing to operate	Less control
NGINX / HAProxy	Highly configurable, good for edge cases	You own it

My default is cloud-managed. At this scale, the last thing you want is to be debugging a load balancer config at 3am. Use round-robin or least-connections, keep sessions stateless if at all possible, and set health checks to catch dead nodes fast.

API Servers — actually do the math

This is where the back-of-envelope pays off. "Add more servers" is useless without a number attached.

Real-world server throughput (these vary, but this is a reasonable ballpark):

Runtime	Requests/sec per server
Go / Rust	10,000 – 20,000
Node.js	5,000 – 10,000
Python / Django	1,000 – 2,000

Working through it for Node.js:

1,000,000 req/sec ÷ 10,000 per server = 100 servers minimum

Add 50% buffer for:
- Traffic spikes
- Rolling deployments (some capacity always offline)
- Server failures

= ~150 servers as your baseline

Auto-scaling config in AWS would look roughly like this:

min_size: 100
max_size: 300
desired_capacity: 150
scaling_policy:
  - metric: CPUUtilization
    threshold: 70
    cooldown: 300      # 5 minutes
    adjustment: +20    # add 20 servers per trigger

One thing worth noting: the cooldown period matters more than most people think. If you scale too aggressively without cooldown, you end up in a flapping loop where you're constantly spinning up and tearing down instances, which causes its own instability.

Message Queue — Kafka vs RabbitMQ

Async processing is the key to keeping API response times fast. Instead of making the user wait while you write to three databases and fire off two external API calls, you publish an event and respond immediately. The heavy lifting happens in the background.

The Kafka vs RabbitMQ question comes up a lot, so here's how I think about it:

	RabbitMQ	Kafka
Best for	Task queues (email, payments, notifications)	Event streaming (logs, activity, pipelines)
Throughput	Up to ~100K msg/sec	1M+ msg/sec
Consumer model	Each message consumed once	Multiple consumers can read same events
Setup complexity	Fairly simple	More involved

Honest take: In my day-to-day work, we use RabbitMQ for things like sending emails and running background jobs. It handles around 10K tasks a day without breaking a sweat, and the operational overhead is minimal. I like it for that use case. But for 1M req/sec? Kafka's partition-based parallelism is the only realistic option. More on that in a second.

Kafka cluster sizing:

Brokers: 10–20 servers
Partitions: 100–200 (this is your parallelism knob)
Replication factor: 3 (survives 2 broker failures simultaneously)

Partition throughput is best expressed in MB/s — it varies heavily with message size, batching, and compression. Think of partition count as your parallelism dial, not a precise throughput number. 100 partitions gives you substantial headroom for our workload.

The partition count is the number you want to think carefully about. You can increase brokers, but changing partition count later is painful. Start with more than you think you need.

Consumer Workers — the workers math is humbling

How many worker instances do you need? Depends entirely on processing time per message:

If each message takes 100ms to process:
  1 worker = 10 messages/sec
  Need: 1,000,000 ÷ 10 = 100,000 workers — obviously not practical

If each message takes 10ms:
  1 worker = 100 messages/sec
  Need: 10,000 workers — expensive

If each message takes 1ms:
  1 worker = 1,000 messages/sec
  Need: 1,000 workers — manageable

This math is why optimizing the hot path inside your workers matters so much. Shaving 5ms off processing time doesn't just make things faster — it cuts your infrastructure costs dramatically.

Practical optimizations:

Batch processing: Consume 100 messages at once, not one at a time. Most databases handle bulk inserts far better than individual writes
Profile your slow paths: Nine times out of ten, there's one database query or external API call doing most of the damage
Language choice matters: A Go worker can genuinely be 5–10x more efficient than a Python one at CPU-bound processing

Redis Caching — this is the one that saves your database

Without caching, you're routing 1M req/sec to your database. Your database will not survive. This isn't a pessimistic take — it's just arithmetic.

Without cache:
  1M req/sec → hits the database → the database catches fire

With 80% cache hit rate:
  800K req/sec → Redis (sub-millisecond, in-memory)
  200K req/sec → Database (this it can handle)

Redis cluster sizing:

Single node: ~50K–200K ops/sec (depends heavily on payload size, command mix, and pipelining)
5-node cluster: comfortable headroom for our workload

With our 80% hit rate:
  800K reads/sec served by Redis
  ~50K writes/sec + misses hitting Redis
  5 nodes is enough, but add 30% headroom for spikes and node failures

What's worth caching:

Data	TTL	Why
User profiles	15 minutes	High read, rare writes
Product catalog	1 hour	Changes infrequently
API responses	1–5 minutes	Depends on freshness requirements
Sessions	30 minutes	Standard pattern

What you should not cache: financial transactions, anything where stale data causes real damage (stock prices, live auction state). Cache is an optimization, not a source of truth.

Database — the write-heavy case is scarier than it looks

The strategy splits entirely based on your read/write ratio.

Read-heavy (90/10):

Writes: 100K/sec → 1 primary handles this
Reads: 900K/sec → after 80% cache hit → 180K reads/sec remain

PostgreSQL replica throughput: ~12K reads/sec each
Need: 180K ÷ 12K = 15 read replicas

Setup: 1 primary + 15 replicas + a read load balancer in front

Write-heavy (50/50) — this is where things get uncomfortable:

Writes: 500K/sec
PostgreSQL single primary ceiling: roughly 10K–30K writes/sec
(heavily dependent on schema, indexes, hardware, and transaction size)

You need sharding.
Using 10K as a conservative anchor: 500K ÷ 10K = 50 shards minimum

Each shard: 1 primary + 5 read replicas
Total: 50 × 6 = 300 database instances

Yes, 300. Welcome to distributed systems.

At this write volume it's also worth asking whether PostgreSQL is the right tool at all — Cassandra, ScyllaDB, or a write-behind/CQRS pattern might be a better fit depending on the access patterns.

Sharding strategies each have tradeoffs:

Hash-based: shard_id = hash(user_id) % 50 — even distribution, easy to implement, hard to do range queries
Range-based: Users 1–1M on shard 1, etc. — great for range queries, terrible if your ID distribution isn't uniform
Geographic: US → US shards, EU → EU shards — good for latency and data residency, complex to operate

The things people underestimate about sharding: cross-shard JOINs are painful (you basically can't do them), rebalancing when you grow is a whole project in itself, and the celebrity problem — where one shard gets disproportionate load because a viral user lives there — is a real production headache that pure hash-based sharding doesn't fully solve.

The stuff the architecture diagrams leave out

This is the part that actually separates senior engineers in interviews. The boxes in the diagram are the easy part. Everything below is what keeps systems running at 3am.

Dead Letter Queues and retry logic

When a worker fails to process a message, you need a plan:

Message → Worker → fails
  ↓
Exponential backoff retry:
  Attempt 1: wait 1s
  Attempt 2: wait 2s
  Attempt 3: wait 4s
  Attempt 4: wait 8s
  Cap at ~5 minutes
  ↓
Still failing? → Dead Letter Queue

The DLQ is where messages go when they've failed retries exhausted. You need to monitor it actively — DLQ depth going up means something is systematically broken, not just transiently bad. Alert on it, review it daily, understand why things end up there.

Idempotency — the thing that bites you eventually

At scale, you will process the same request twice. Network timeouts, client retries, deploy hiccups — it happens. If your system isn't idempotent, the user gets charged twice or sees a duplicate order.

// Client sends a unique requestId every time
POST /api/orders
{
  "requestId": "uuid-12345-abc",
  "item": "product-789",
  "quantity": 1
}

// Before processing, check if we've seen this before
const key = `processed:${requestId}`;
if (await redis.exists(key)) {
  return await redis.get(key); // Already done, return cached result
}

const result = await processOrder(payload);
await redis.set(key, result, { EX: 86400 }); // 24hr TTL
return result;

This isn't theoretical. In my current project we submit license orders to a Microsoft API. We always pass a requestId. If the request times out and we retry, Microsoft checks the ID and returns the same response instead of creating a duplicate order. Same pattern on both sides of the call. It's saved us from some embarrassing incidents.

Rate limiting — protect yourself before users abuse you

Reasonable limits:
  Per user:    1,000 req/min
  Per IP:      10,000 req/min
  Per API key: 100,000 req/min

Simple Redis sliding window:

async function checkRateLimit(userId) {
  const key = `rate:${userId}:${Math.floor(Date.now() / 60000)}`;
  const count = await redis.incr(key);
  await redis.expire(key, 60);

  if (count > 1000) {
    return { allowed: false, retryAfter: 60 };
  }
  return { allowed: true };
}

Monitoring — you need this before everything else, not after

There's a temptation to treat observability as something you add once the system is built. That's backwards. At this scale, you're essentially flying blind without it.

What to track:

Metric	Alert threshold
Error rate	> 5% → Critical
p99 latency	> 1s → Critical
p95 latency	> 500ms → Warning
Kafka consumer lag	> 100K messages → Critical
DLQ depth	> 100 messages → Warning
Cache hit rate	< 70% → Warning
DB replication lag	> 5s → Critical
CPU / Memory	> 80%/85% → Warning

Tools I'd actually use:

Prometheus + Grafana for metrics and dashboards
ELK stack for log aggregation and search
Jaeger or Zipkin for distributed tracing (invaluable for finding where latency is actually hiding)
DataDog or New Relic if the budget is there — the out-of-the-box integrations save real time

What this actually costs

I think this section is underrated in system design discussions. Architecture at this scale costs real money, and you should be able to talk about it.

Monthly AWS estimate (on-demand pricing):

Component	Setup	Monthly
API Servers	150 × t3.large @ $0.08/hr	$8,640
Kafka	20 × r5.xlarge @ $0.25/hr	$3,600
Workers	1,000 × t3.medium @ $0.04/hr	$28,800
Redis	5 × r5.large @ $0.13/hr	$468
Database	50 × db.r5.2xlarge @ $0.60/hr	$21,600
Load Balancers	—	~$500
Total		~$63,600/month

That's ~$763K/year at on-demand pricing. These figures are illustrative — instance pricing varies by region and changes over time, so verify against current AWS pricing before any real planning. With reserved instances on stable workloads and spot instances for workers (with proper shutdown handling), you can realistically get to $300–400K/year. Still a significant number, but that's what operating at scale looks like.

How to actually present this in an interview

The first 60–90 seconds should be a high-level summary, then let the interviewer steer. Don't try to say everything — that's another common mistake.

Something like:

"For 1M req/sec, I'd go with a multi-layered setup. Load balancer in front distributing to ~150 API servers with auto-scaling at 70% CPU. Servers publish events to Kafka — 100 partitions, replication factor 3. Around 1,000 workers consume from Kafka. Redis in front of the database targeting 80% hit rate, which gets effective database load down to ~200K req/sec.

On the database side, if read-heavy, one primary and ~15 read replicas. If write-heavy, we'd need sharding — probably 50 shards. Total cost on AWS is roughly $64K/month on-demand, around $30–40K optimized.

Operationally: rate limiting at the gateway, idempotency keys on critical operations, DLQs with exponential backoff, Prometheus and Grafana for observability.

What would you like to dig into?"*

Then stop talking. The follow-up questions are where you actually demonstrate depth.

Common ones you should be ready for:

"What if Kafka goes down?" → Replication factor 3 means you survive 2 broker failures. Leader election handles failover automatically. Clients should retry with backoff.
"How do you handle celebrity users?" → Pre-warm cache for known high-traffic accounts. Separate priority queues. Pre-computed aggregates for read-heavy celebrity data.
"Database hotspots?" → Monitor per-shard load in Grafana. Consistent hashing helps distribute evenly. Celebrity shards get extra replicas.
"Zero-downtime deploys?" → Blue-green for API servers (stateless, easy). Canary releases for Kafka consumer changes (consumers are stateful, more careful rollout needed).

What I actually learned writing this

I went into this thinking I understood scaling reasonably well. The exercise of doing the actual math on everything — workers, shards, Redis nodes, replicas — was more humbling than I expected.

A few things that shifted:

The worker count math is what gets people. It's easy to say "spin up more workers" until you realize that 100ms average processing time means you need 100,000 of them. Suddenly optimizing your hot path from 100ms to 1ms isn't a nice-to-have — it's a 100x reduction in infrastructure cost.

Caching isn't an optimization at this scale. It's structural. Without it, the database numbers simply don't work, regardless of how well you've sharded.

And the cost section genuinely surprised me. I knew it would be expensive. I didn't have a specific number in my head. $64K/month is a real thing to say in an interview — it shows you're thinking like someone who's going to own this in production, not just design it on a whiteboard and hand it off.