I Built an Event Ingestion System β Then It Broke Under Load
A step toward understanding systems under real load

π§ How I started ?
Iβve always been fascinated by how systems scale under heavy load.
That curiosity pushed me to explore it more deeply, so I started building this project.
Through this, I aim to understand how backend systems behave β from handling hundreds of concurrent users to (hopefully) millions.
π οΈ What I'm building ?
I'm building a platform that enables developers to collect, process, and analyze events from their applications to gain deep visibility into user behavior and system performance.
It acts as a lens into application activity, transforming raw, scattered events into structured insights that help developers understand what is happening inside their systems in real-time and over time.
βοΈ How I started building it ?
My approach was to start with simple working POC.
So spun up my Cursor IDE, initialized NodeJs + TS project and plugged in some helper tools that I like to have in my project (husky, prettier) and started with apis:
create project- accepts project name, creates an API key for user, saves into dbcreate event- accepts event payload and store raw data into db
For database, first I thought of using NoSQL for raw events + SQL for structured and aggregated data, but not to make things complex initially, went with Postgres.
So far, my initial flow was: Client -> API -> DB
π First load test
After building apis, I thought to test the project capacity at this level, so I researched about load testing and got to know about autocannon, a load testing tool.
So, I fired up my server, up and running as healthy as never before, triggered my first test with 100 concurrent requests for 10s.
Well, this little thing handled it nicely.
So,I increased requests count to 200 -> 300 -> 500, for same 10s duration, I can see latency going crazy high, and yes, it was a burst of requests, not steady load.
Then comes the bomber, I tested with 1000 concurrent requests, and BOOM, latency increased and requests processing per second were constant at ~3700 req/sec.
That means:
latency increased with number of concurrent requests (as expected).
requests/second were constant (this is little interesting)
π₯ Hit first bottleneck
Looking at the insights, I concluded, latency increase and that is expected behavior, but increased latecy means nodejs is handling requests but requests needs to wait for something, so what's stopping requests to process at a limit of ~3700 req/sec.
There comes the db into picture, it was database that had a throughput limit of ~3-4 writes/sec.
And those were sync writes, then means requests had to wait for db write before handling db another request's data to write.
Concluding this:
latency increased
DB is capped at throughput of ~3-4 writes/sec
DB is slower than API
Requests pile up
π Moving to async processing
Instead of letting requests to wait for completing db write, requests data can be stored temporarily somewhere and can allow requests to finish.
To achieve this, I created an in-memory queue and stored requests data, i.e., events payload to process later.
To consume that queue, I need to create one worker that can take those events payload data and start writing to database.
This helped with latency with same db throughput.
π It didn't go as planned
Now, I got other issues:
timeouts started
system became unstable with growing queue
backpressure (queue data incoming > comsuming by worker)
π Understanding backpressure
Backpressure is like when there is less outcoming and more incoming data.
producer > consumer
API faster than worker
queue growing undefinitely
π§© Fixing the system
Bounded Queue
Because of growing queue without limits, there is some point when system became out of memory and got crashed
So, I need to put a limit on queue and events above that limit will be dropped.
That comes with a trade-off: whether we want our system to get crashed due to memory shortage or drop some events to continue running the server.
Controlled worker throughput
Initially, I was only focused on making the API faster, but I realized that the worker also needed to be controlled.
Instead of processing all events as quickly as possible, I introduced a more controlled approach:
Events were processed in batches instead of individually (100 events at once)
Each batch was written to the database sequentially
This ensured that the worker didnβt overwhelm the database with too many concurrent writes.
Interestingly, this also acted as a natural rate limiter β the worker could only process events as fast as the database allowed.
Retry Mechanism
Failure happens, event insertion might have failed, to handle this, I created a retry queue that processes failed event insertions again with a limit of upto 3 times after that the event data is discarded.
This limiting helps in avoiding infinite loops of retrying failed event insertions.
ποΈ Final Architecture
π‘ Key Learnings
Throughput β Concurrency
Async systems need backpressure
Batching improves DB performance
You don't remove bottlenecks, you move them
βοΈ Tradeoffs I Had to Accept
Dropping events VS Crashing system
Latency VS Reliability
Simplicity VS Robustness
π Whatβs Next
This project helped me understand how systems behave under load, but it also showed me how much more there is to explore.
Some things I'm planning to work on next:
Moving from in-memory queues to a durable system like Redis or Kafka
Introducing a Dead Letter Queue (DLQ) for failed events
Improving scalability by running multiple workers
Exploring how to distribute load across multiple instances
The goal is to keep evolving this system and learn how real-world backend systems are designed and scaled.



