Thundering Herd Problem

If you are a developer or even know someone who is you’ve probably tried booking a Tatkal ticket at least once. Everything feels normal until the exact moment bookings open. Suddenly, the website starts lagging, payments fail, the UI breaks in places, and within one or two minutes, tickets are gone. A system designed to handle millions of users somehow turns into what feels like a lucky draw.

This doesn’t happen because the system is poorly built. It happens because millions of users perform the exact same action at the exact same second. When Tatkal opens, every user clicks “Search” or “Book” simultaneously. This creates a massive spike of requests hitting the same servers, databases, and payment systems at once. Even highly scalable systems have limits on how many operations they can process per second. When those limits are exceeded suddenly, the system slows down, fails requests, or temporarily becomes unstable.

This phenomenon is known as the Thundering Herd Problem when a large number of clients simultaneously request the same resource, overwhelming backend systems.

In this blog, we’ll understand why this happens internally and how engineers design systems to handle it using techniques like caching, request coalescing, rate limiting, and exponential backoff.

How Systems Handle the Thundering Herd Problem

Once you understand why Tatkal booking systems slow down, the next question is obvious: how do engineers prevent it?

Modern distributed systems use a combination of techniques to ensure that millions of users don’t overwhelm the backend at the same time. Each technique solves a different part of the problem.

1. Request Coalescing(Let one request do the work)

When thousands of users request the same resource (for example, seat availability for the same train), there’s no need for the server to process the same query thousands of times.

Request coalescing ensures that only one request reaches the backend, while the others wait for its result.

The server processes the request once and returns the same response to everyone waiting.

This significantly reduces backend load and prevents crashes during traffic spikes.

This technique is widely used in CDNs, caching layers, and high-traffic APIs.

2. Cache Locking (Mutex): Prevent multiple cache rebuilds

Cache is often used to store frequently accessed data. But when a cache entry expires, multiple incoming requests may try to rebuild it at the same time.

Without protection, this leads to:

Duplicate database queries
Sudden load spikes
Backend overload

A mutex (mutual exclusion lock) ensures that only one request can rebuild the cache.

This prevents unnecessary database calls and keeps the system stable.

3. Staggered Expiry (Jitter): Avoid synchronized cache expiration

One of the most common causes of thundering herd is when thousands of cache entries expire at the exact same time.

Instead of using a fixed expiration time, systems add a small random variation (jitter).

Now cache entries expire gradually instead of all at once.

This spreads traffic evenly and prevents sudden spikes.

This technique is used heavily in systems like Redis, Netflix, and large-scale web platforms.

4. Exponential Backoff (Control retry storms)

When a request fails, clients usually retry. But if thousands of clients retry immediately, it makes the problem worse.

Exponential backoff increases the delay between retries.

Example retry pattern:

Often combined with randomness:

This gives the system time to recover and prevents retry storms.

This is commonly used in payment systems, cloud APIs, and distributed queues.

5. Rate Limiting(Control how many requests are allowed)

Rate limiting restricts how many requests a user or system can make within a specific time period.

Example:

If requests exceed the limit:

they may be delayed,
queued,
or rejected temporarily.

This protects backend services from overload and ensures fair usage.

Common rate limiting algorithms include:

Fixed Window
Sliding Window
Token Bucket
Leaky Bucket

Most public APIs and authentication systems use rate limiting.

How these techniques work together

Real systems don’t rely on just one solution. They combine multiple layers of protection:

Cache reduces database load
Request coalescing prevents duplicate work
Mutex prevents simultaneous cache rebuilds
Jitter prevents synchronized expiry
Backoff prevents retry storms
Rate limiting prevents overload

Together, these techniques ensure systems remain stable even when millions of users act at the same moment.