I Built an Event Ingestion System — Then It Broke Under Load

🧠 How I started ?

I’ve always been fascinated by how systems scale under heavy load.
That curiosity pushed me to explore it more deeply, so I started building this project.

Through this, I aim to understand how backend systems behave — from handling hundreds of concurrent users to (hopefully) millions.

🛠️ What I'm building ?

I'm building a platform that enables developers to collect, process, and analyze events from their applications to gain deep visibility into user behavior and system performance.

It acts as a lens into application activity, transforming raw, scattered events into structured insights that help developers understand what is happening inside their systems in real-time and over time.

⚙️ How I started building it ?

My approach was to start with simple working POC.

So spun up my Cursor IDE, initialized NodeJs + TS project and plugged in some helper tools that I like to have in my project (husky, prettier) and started with apis:

create project - accepts project name, creates an API key for user, saves into db
create event - accepts event payload and store raw data into db

For database, first I thought of using NoSQL for raw events + SQL for structured and aggregated data, but not to make things complex initially, went with Postgres.

So far, my initial flow was: Client -> API -> DB

🚀 First load test

After building apis, I thought to test the project capacity at this level, so I researched about load testing and got to know about autocannon, a load testing tool.

So, I fired up my server, up and running as healthy as never before, triggered my first test with 100 concurrent requests for 10s.

Well, this little thing handled it nicely.

So,I increased requests count to 200 -> 300 -> 500, for same 10s duration, I can see latency going crazy high, and yes, it was a burst of requests, not steady load.

Then comes the bomber, I tested with 1000 concurrent requests, and BOOM, latency increased and requests processing per second were constant at ~3700 req/sec.

That means:

latency increased with number of concurrent requests (as expected).
requests/second were constant (this is little interesting)

💥 Hit first bottleneck

Looking at the insights, I concluded, latency increase and that is expected behavior, but increased latecy means nodejs is handling requests but requests needs to wait for something, so what's stopping requests to process at a limit of ~3700 req/sec.

There comes the db into picture, it was database that had a throughput limit of ~3-4 writes/sec.

And those were sync writes, then means requests had to wait for db write before handling db another request's data to write.

Concluding this:

latency increased
DB is capped at throughput of ~3-4 writes/sec
DB is slower than API
Requests pile up

🔄 Moving to async processing

Instead of letting requests to wait for completing db write, requests data can be stored temporarily somewhere and can allow requests to finish.

To achieve this, I created an in-memory queue and stored requests data, i.e., events payload to process later.

To consume that queue, I need to create one worker that can take those events payload data and start writing to database.

This helped with latency with same db throughput.

😅 It didn't go as planned

Now, I got other issues:

timeouts started
system became unstable with growing queue
backpressure (queue data incoming > comsuming by worker)

📉 Understanding backpressure

Backpressure is like when there is less outcoming and more incoming data.

producer > consumer
API faster than worker
queue growing undefinitely

🧩 Fixing the system

Bounded Queue
- Because of growing queue without limits, there is some point when system became out of memory and got crashed
- So, I need to put a limit on queue and events above that limit will be dropped.
- That comes with a trade-off: whether we want our system to get crashed due to memory shortage or drop some events to continue running the server.
Controlled worker throughput
- Initially, I was only focused on making the API faster, but I realized that the worker also needed to be controlled.
- Instead of processing all events as quickly as possible, I introduced a more controlled approach:
  - Events were processed in batches instead of individually (100 events at once)
  - Each batch was written to the database sequentially
- This ensured that the worker didn’t overwhelm the database with too many concurrent writes.
- Interestingly, this also acted as a natural rate limiter — the worker could only process events as fast as the database allowed.
Retry Mechanism
- Failure happens, event insertion might have failed, to handle this, I created a retry queue that processes failed event insertions again with a limit of upto 3 times after that the event data is discarded.
- This limiting helps in avoiding infinite loops of retrying failed event insertions.

🏗️ Final Architecture

💡 Key Learnings

Throughput ≠ Concurrency
Async systems need backpressure
Batching improves DB performance
You don't remove bottlenecks, you move them

⚖️ Tradeoffs I Had to Accept

Dropping events VS Crashing system
Latency VS Reliability
Simplicity VS Robustness

🚀 What’s Next

This project helped me understand how systems behave under load, but it also showed me how much more there is to explore.

Some things I'm planning to work on next:

Moving from in-memory queues to a durable system like Redis or Kafka
Introducing a Dead Letter Queue (DLQ) for failed events
Improving scalability by running multiple workers
Exploring how to distribute load across multiple instances

The goal is to keep evolving this system and learn how real-world backend systems are designed and scaled.

I Built an Event Ingestion System — Then It Broke Under Load

🧠 How I started ?

🛠️ What I'm building ?

⚙️ How I started building it ?

🚀 First load test

💥 Hit first bottleneck

🔄 Moving to async processing

😅 It didn't go as planned

📉 Understanding backpressure

🧩 Fixing the system

🏗️ Final Architecture

💡 Key Learnings

⚖️ Tradeoffs I Had to Accept

🚀 What’s Next

Comments

From Simple APIs to Scalable Systems

Queues, Replicas & Bottlenecks: Scaling My Event Ingestion Pipeline

More from this blog

EventLens : Building a Developer Analytics Platform from Scratch

From Docker Compose to Kubernetes: Adding Helm Charts to EventLens

Observability: Understanding System Pressure with Metrics, Traces & Logs

Queues, Replicas & Bottlenecks: Scaling My Event Ingestion Pipeline

Command Palette

🧠 How I started ?

🛠️ What I'm building ?

⚙️ How I started building it ?

🚀 First load test

💥 Hit first bottleneck

🔄 Moving to async processing

😅 It didn't go as planned

📉 Understanding backpressure

🧩 Fixing the system

🏗️ Final Architecture

💡 Key Learnings

⚖️ Tradeoffs I Had to Accept

🚀 What’s Next

Comments

From Simple APIs to Scalable Systems

Queues, Replicas & Bottlenecks: Scaling My Event Ingestion Pipeline

More from this blog