Building EventLens: Kafka, Kubernetes, Observability and SDK

🔁 Where I Left Off

By the end of Phase 3, I had built something I was genuinely proud of.

Events flowing through Kafka. Workers batch-inserting into PostgreSQL. Prometheus scraping metrics every 5 seconds. Grafana dashboards showing queue depth in real time. OpenTelemetry traces connecting a single HTTP request all the way through to the database write.

It was elegant. It was observable. It was completely unusable.

There was no SDK. No dashboard. No way for an actual developer to send events without opening a terminal and crafting a curl request. I had spent three phases building a pipeline and zero time building the thing that feeds it

🤔 The Gap Nobody Talks About

Here's something the architecture diagrams never show you: the arrows.

Every diagram has boxes — API, Queue, Worker, Database. The boxes get all the attention. You optimize them, monitor them, scale them. But the arrows? The arrows just... exist. They're assumed.

What I realized is that one of those arrows was missing entirely. The arrow from "developer's website" to "my API." There was no SDK. No integration. Nothing.

A backend pipeline with no client is just infrastructure with ambitions. It doesn't matter how many events per second you can handle if nobody can send you a single one.

So that became the job: build the arrow.

⚡ One More Backend Thing First: Kafka

Before touching the client side, I made a call I'd been putting off — replacing Redis lists with Kafka.

Redis had been fine for learning. Messages go in, messages come out, worker processes them. Simple. But there's a dark side to Redis lists that only shows up when things go wrong: once a message is popped off the list, it's gone. Worker crashes mid-batch? Those events disappear into the void. No replay. No recovery. Just silence.

Kafka doesn't do that. It persists messages to disk and keeps them for a configurable window. If your consumer crashes, it picks up exactly where it left off when it restarts. That's not a nice-to-have for an analytics system — it's the whole point. Losing events is silent data corruption. You don't even know it's happening.

I also added a retry topic pattern. Failed batches don't get dropped anymore — they get published to a retry topic, reprocessed, and only land in the dead letter queue after exhausting all attempts.

The pipeline finally had a proper failure path instead of a quiet shrug.

🛠️ Building the SDK

The SDK had one job: make sending events require zero thought.

One script tag. One API key. You're done. That was the design constraint.

The first real decision was batching vs. per-event. Batching is more efficient — fewer requests, better throughput. But it comes with baggage: flush intervals, flush-on-unload logic, "what happens if the user closes the tab before the batch fires?" After thinking through the edge cases, I went with per-event for v1. Each event is its own POST request, fire and forget, no waiting for a response.

Simple. Predictable. One fewer thing to debug at 2am.

There was one exception though. And it nearly broke my brain.

👋 The page_leave Problem

Here's a fun game: try to reliably detect when a user leaves your page.

Sounds easy. It is not easy.

My first instinct was beforeunload + fetch. Makes sense, right? User leaves, we fire a request. Except the browser cancels in-flight fetch requests when the page unloads. The event never arrives. The server never knows the user left.

sendBeacon exists specifically for this problem. It queues the request and sends it after the page is gone, even after the tab is closed. It's reliable in a way that fetch just isn't during unload.

But — and this is the part that cascades through the entire SDK design — sendBeacon cannot send custom headers. You can't do Authorization: Bearer your_key. Headers aren't configurable. Full stop.

So the API key had to go in the request body.

Every event payload now looks like: { "api_key": "el_xxx", "event_name": "page_leave", "metadata": { "duration_ms": 4200 } }

One constraint from one edge case reshaped the authentication model for the entire SDK. That's how it goes.

🎯 What to Actually Capture

Auto-capture is a trap if you're not careful.

The naive version captures every click on every element. Sounds comprehensive. In practice you end up with thousands of events on wrapper divs, SVG icons, and padding areas that tell you absolutely nothing. The noise drowns out the signal.

The philosophy I landed on: capture answers, not clicks. Only capture elements where a click carries actual intent — <button>, <a>, <form> and anything explicitly tagged with data-eventlens. Everything else gets ignored.

One delegated listener on document. On every click, walk up the DOM with closest(). If nothing meaningful is found, do nothing. It sounds almost too simple, but it produces clean data instead of a firehose of garbage.

For SPA navigation, the SDK patches history.pushState and history.replaceState. React Router, Next.js, every modern router uses those methods. Page views fire automatically on navigation without any framework-specific integration. No one asks you to set anything up.

🔐 The Session Question

Every event carries a session_id. The interesting design question was where to store it.

localStorage persists across tabs and browser restarts. sessionStorage resets when the tab closes.

For a session, sessionStorage is actually the right choice. A new tab means a new session. A page refresh doesn't. That maps naturally to how users actually think about "a visit." It would be weird to consider someone's tab from three days ago the same session as right now.

Small decision. Meaningful downstream.

📦 Publishing to npm

I published the SDK as eventlens-js.

The first attempt returned a 403. The token had publish permissions — I checked. Turns out npm has a separate "2FA bypass" setting that's off by default even when all other permissions are enabled. You need either an Automation token or a Granular token with bypass explicitly checked.

A detail. An annoying one. Fixed in five minutes once you know what to look for.

The package ships at 4 kB gzipped. Zero runtime dependencies.

📊 The Dashboard

With events flowing in, I built the React dashboard. TanStack Query for server state, Recharts for visualizations, Clerk for auth, shadcn/ui for components.

Everything auto-refreshes every 5 seconds.

That interval was a deliberate choice. Fast enough that the dashboard feels alive. Slow enough that it isn't hammering the server. Watching the event volume chart tick upward in real time after dropping the SDK into a test page was genuinely exciting — the kind of moment where months of backend work suddenly feel concrete.

The Event Explorer shows events latest-first with cursor-based infinite scroll. You can filter by event name, user ID, and date range. Click any row to see the full metadata. It's the kind of UI that makes debugging feel like exploration rather than archaeology.

🗺️ When It All Clicked

There's a moment in every project where the pieces connect and the whole thing makes sense.

For me it was this: I added the SDK to a test HTML page, opened the EventLens dashboard in another tab, clicked a button, and watched button_click appear in the Event Explorer with the right metadata, the right timestamp, and the right session ID.

Three phases of Kafka configs, Docker replicas, Prometheus metrics, OpenTelemetry spans — all of it collapsed into one row in a table.

That's what the pipeline was for.

🎯 Closing Remarks

I started this project wanting to understand one thing: how do systems behave under load?

Four blogs later, I understand a lot more than that.

I understand why async queues exist — not because synchronous writes are slow, but because they make your API's latency dependent on your database's mood. I understand why Kafka replaced Redis for me — not because Redis is bad, but because "fire and forget" is only acceptable when you can afford to forget. I understand why observability isn't optional — because distributed systems fail silently long before they fail visibly, and metrics are the only way to catch the silence.

And I understand now that the backend is only half the story. The client — the SDK, the dashboard, the thing developers actually touch — is where the system becomes real. You can build the most elegant pipeline in the world and it means nothing if nobody can get data into it without reading your source code.

The most honest thing I can say about this project is that almost nothing worked the way I expected it to on the first try. The load test broke things I thought were solid. Redis queues introduced problems I didn't anticipate. Kafka brought its own learning curve. sendBeacon quietly invalidated an authentication design I'd already committed to.

Every phase was a lesson in the same underlying truth: you don't discover where a system breaks by thinking about it. You discover it by building it.

That's why I kept writing. Not to document what I built, but to document what it cost me to understand it.

EventLens : Building a Developer Analytics Platform from Scratch

🔁 Where I Left Off

🤔 The Gap Nobody Talks About

⚡ One More Backend Thing First: Kafka

🛠️ Building the SDK

👋 The page_leave Problem

🎯 What to Actually Capture

🔐 The Session Question

📦 Publishing to npm

📊 The Dashboard

🗺️ When It All Clicked

🎯 Closing Remarks

Comments

From Simple APIs to Scalable Systems

I Built an Event Ingestion System — Then It Broke Under Load

More from this blog

From Docker Compose to Kubernetes: Adding Helm Charts to EventLens

Observability: Understanding System Pressure with Metrics, Traces & Logs

Queues, Replicas & Bottlenecks: Scaling My Event Ingestion Pipeline

I Built an Event Ingestion System — Then It Broke Under Load

Command Palette

🔁 Where I Left Off

🤔 The Gap Nobody Talks About

⚡ One More Backend Thing First: Kafka

🛠️ Building the SDK

👋 The page_leave Problem

🎯 What to Actually Capture

🔐 The Session Question

📦 Publishing to npm

📊 The Dashboard

🗺️ When It All Clicked

🎯 Closing Remarks

Comments

From Simple APIs to Scalable Systems

I Built an Event Ingestion System — Then It Broke Under Load

More from this blog