Skip to main content

Command Palette

Search for a command to run...

I Built an Event Ingestion System β€” Then It Broke Under Load

A step toward understanding systems under real load

Updated
β€’5 min read
I Built an Event Ingestion System β€” Then It Broke Under Load
A
Exploring systems, scalability, and real-world bottlenecks by breaking them

🧠 How I started ?

I’ve always been fascinated by how systems scale under heavy load.
That curiosity pushed me to explore it more deeply, so I started building this project.

Through this, I aim to understand how backend systems behave β€” from handling hundreds of concurrent users to (hopefully) millions.

πŸ› οΈ What I'm building ?

I'm building a platform that enables developers to collect, process, and analyze events from their applications to gain deep visibility into user behavior and system performance.

It acts as a lens into application activity, transforming raw, scattered events into structured insights that help developers understand what is happening inside their systems in real-time and over time.

βš™οΈ How I started building it ?

My approach was to start with simple working POC.

So spun up my Cursor IDE, initialized NodeJs + TS project and plugged in some helper tools that I like to have in my project (husky, prettier) and started with apis:

  • create project - accepts project name, creates an API key for user, saves into db

  • create event - accepts event payload and store raw data into db

For database, first I thought of using NoSQL for raw events + SQL for structured and aggregated data, but not to make things complex initially, went with Postgres.

So far, my initial flow was: Client -> API -> DB

πŸš€ First load test

After building apis, I thought to test the project capacity at this level, so I researched about load testing and got to know about autocannon, a load testing tool.

So, I fired up my server, up and running as healthy as never before, triggered my first test with 100 concurrent requests for 10s.

Well, this little thing handled it nicely.

So,I increased requests count to 200 -> 300 -> 500, for same 10s duration, I can see latency going crazy high, and yes, it was a burst of requests, not steady load.

Then comes the bomber, I tested with 1000 concurrent requests, and BOOM, latency increased and requests processing per second were constant at ~3700 req/sec.

That means:

  • latency increased with number of concurrent requests (as expected).

  • requests/second were constant (this is little interesting)

πŸ’₯ Hit first bottleneck

Looking at the insights, I concluded, latency increase and that is expected behavior, but increased latecy means nodejs is handling requests but requests needs to wait for something, so what's stopping requests to process at a limit of ~3700 req/sec.

There comes the db into picture, it was database that had a throughput limit of ~3-4 writes/sec.

And those were sync writes, then means requests had to wait for db write before handling db another request's data to write.

Concluding this:

  • latency increased

  • DB is capped at throughput of ~3-4 writes/sec

  • DB is slower than API

  • Requests pile up

πŸ”„ Moving to async processing

Instead of letting requests to wait for completing db write, requests data can be stored temporarily somewhere and can allow requests to finish.

To achieve this, I created an in-memory queue and stored requests data, i.e., events payload to process later.

To consume that queue, I need to create one worker that can take those events payload data and start writing to database.

This helped with latency with same db throughput.

πŸ˜… It didn't go as planned

Now, I got other issues:

  • timeouts started

  • system became unstable with growing queue

  • backpressure (queue data incoming > comsuming by worker)

πŸ“‰ Understanding backpressure

Backpressure is like when there is less outcoming and more incoming data.

  • producer > consumer

  • API faster than worker

  • queue growing undefinitely

🧩 Fixing the system

  1. Bounded Queue

    • Because of growing queue without limits, there is some point when system became out of memory and got crashed

    • So, I need to put a limit on queue and events above that limit will be dropped.

    • That comes with a trade-off: whether we want our system to get crashed due to memory shortage or drop some events to continue running the server.

  2. Controlled worker throughput

    • Initially, I was only focused on making the API faster, but I realized that the worker also needed to be controlled.

    • Instead of processing all events as quickly as possible, I introduced a more controlled approach:

      • Events were processed in batches instead of individually (100 events at once)

      • Each batch was written to the database sequentially

    • This ensured that the worker didn’t overwhelm the database with too many concurrent writes.

    • Interestingly, this also acted as a natural rate limiter β€” the worker could only process events as fast as the database allowed.

  3. Retry Mechanism

    • Failure happens, event insertion might have failed, to handle this, I created a retry queue that processes failed event insertions again with a limit of upto 3 times after that the event data is discarded.

    • This limiting helps in avoiding infinite loops of retrying failed event insertions.

πŸ—οΈ Final Architecture

πŸ’‘ Key Learnings

  • Throughput β‰  Concurrency

  • Async systems need backpressure

  • Batching improves DB performance

  • You don't remove bottlenecks, you move them

βš–οΈ Tradeoffs I Had to Accept

  • Dropping events VS Crashing system

  • Latency VS Reliability

  • Simplicity VS Robustness

πŸš€ What’s Next

This project helped me understand how systems behave under load, but it also showed me how much more there is to explore.

Some things I'm planning to work on next:

  • Moving from in-memory queues to a durable system like Redis or Kafka

  • Introducing a Dead Letter Queue (DLQ) for failed events

  • Improving scalability by running multiple workers

  • Exploring how to distribute load across multiple instances

The goal is to keep evolving this system and learn how real-world backend systems are designed and scaled.

From Simple APIs to Scalable Systems

Part 1 of 5

A hands-on journey of building backend systems and understanding how they behave under real load. From simple APIs to scalable architectures, this series focuses on learning by building, breaking, and improving systems.

Up next

Queues, Replicas & Bottlenecks: Scaling My Event Ingestion Pipeline

Exploring queues, batching, load balancing, and system bottlenecks while scaling an event ingestion architecture.