You, Me, and Data Denormalization

The cybersecurity space, especially the magical land of threat intelligence and "automated" threat intelligence, is surprisingly painful. It is manual (yes, manual), bloated, noisy, and generally soul crushing for the poor SOC team who has to wade through alerts every single day. There is a reason SOC teams exist, and this mountain of noise is a big chunk of that reason.

At Pistachio, we had one of those moments of clarity. After our CTO spent a month going through audit logs, he began seriously contemplating the nearest window as an exit strategy. Keep in mind that we barely use SharePoint internally. Since most of us are developers, our logs were nothing compared to real companies.

The rise of Presence

Enter the age of AI. While others do dramatic pivots, we had nothing old to pivot from which was honestly fantastic. So we built Presence, our fully automated, AI-driven insider threat detection system. It chews through audit logs from providers like Microsoft Purview, HubSpot, and Gong, runs them through a multi-model reasoning pipeline, and spits out insights and anomalies to our customers.

To pull this off, you should understand what audit logs actually are. Microsoft logs everything. Open a file? Logged. Download? Logged. Delete? Logged. Your audit logs quickly become a diary of every digital sneeze that you and everyone else in your organization makes.

While brainstorming how to turn this haystack into something useful, we ran into a funny realization. A good organization should not get many alerts, meaning the product often ends up saying "Congrats, you are normal." Great for customers, less great for showing that we are actually doing something. So we added a live counter of the total audit logs ingested and processed; a glorified odometer for admins to see that yes, we are still alive and working.

The great split: stats vs data

To achieve the aforementioned dashboard, we needed some way of propagating the data over Pub/Sub to the necessary systems, run a background job to notify about changes, and have those notifications be consumed by a client-side SSE listener. Quick and easy 10-minute adventure, right? Almost.

Since we run a microservice architecture, we split things into two domains:

event-DB: raw audit logs, exports, alert storage
stats-DB: graphs, breakdowns, dashboards (flash stuff basically)

The split was not just architectural tidiness. It was deliberate separation of concern. The ingestion pipeline that processes logs, enriches them, and records flagged anomalies needs to stay predictable. Stats and dashboards are bursty and depend entirely on user behavior, and we did not want those traffic spikes affecting ingestion.

This separation can be seen as the strict separation of concern rule:

event-DB owns truth. Raw logs, enrichments, alerts.
stats-DB owns interpretation. Aggregations, counters, time windows, and everything powering the UI.

The philosophy was simple: stats must never be able to hurt ingestion, and ingestion must never care about stats.

With that clear line, we could throttle, cache, or rewrite the stats pipeline freely without ever risking delayed log processing.

And with our data flow being asynchronous over Pub/Sub in GCP, wiring things up was easy. One message in, two services get fed. A few hours later we had logs going into both systems. Twice the inserts, twice the fun.

Now, why PostgreSQL? Honestly, I know PostgreSQL, and I knew I wanted a relational DB, so the choice was simple. Our CTO was convinced Postgres was a terrible idea, but he did not really have a better option either. For his part of reading and writing data before sending it off to our AI models, he went with BigTable. But for this use case, we would end up breaking things with intense read hotspotting. So Postgres won by default.

Once things were flowing, I remembered Postgres had partitioning. Perfect timing, because stuffing millions of logs into a single table is a great way to ruin your week. Partitioning looked like the escape hatch we needed.

We first thought about hash partitioning by org, but that died fast. Some orgs make 300 logs, some make 30 million. That is how you end up with one tiny partition and one absolute monster. Congrats, you reinvented unbalanced shards. Not ideal.

So we went with daily partitions, which should be a lot nicer. Except PostgreSQL does not make future partitions automatically. For a few days, we just crashed. Once we added a job to create partitions ahead of time (and delete anything older than 90 days), the sea of red turned green and life was good again. Well, mostly. Deleting old partitions came with a new problem: what if a customer wants to look back at an alert more than 90 days old? Enter migrations to persistent tables or CSV snapshots at alert time. That solved the "oh no we deleted the evidence" problem.[1]

PostgreSQL partitioning gives you lifecycle management for free. When data is partitioned by day, you do not delete millions of rows. You drop a table. That is a constant time metadata operation. Instead of Postgres scanning and deleting row by row, it simply removes the entire partition instantly. No vacuum backlog. No bloat. No pain.

Another benefit is index containment. In a single huge table, indexes grow forever and every write must update that global index. With partitions, each child table has its own set of local indexes. This keeps indexes small, write amplification low, and index bloat isolated to a specific day rather than the entire dataset.

Partition pruning means Postgres ignores irrelevant partitions entirely, which keeps scans tight and cache friendly even at high scale.

Reality hits like a bag of bricks

Things worked beautifully, right up until more customers actually started using the system.

Our first big issue: live aggregation everywhere. COUNT queries, login counts, 24 hour breakdowns, you name it. Even with indexes, these can force sequential scans. With a few orgs it was fine, but with many orgs, the DB CPUs overheated both emotionally and spiritually.

So we changed our strategy from "perfect real-time numbers" to "99.9 percent correct and we all get to keep our sanity". We denormalized on insert. Instead of counting later, we increment counters as logs arrive. Insert, increment, done. First problem solved.

Then we unintentionally started DDoSing ourselves, because we had customers. Every admin wants to see fresh data, right? And they didn’t believe the numbers were updated in real time. Refresh, refresh, refresh. And every refresh triggered heavy queries. In reality, we just placed too much faith in the lowest tier PostgreSQL instance in GCP. We were not actually getting DDoSed, we were just being stupid.

We implemented a "stale while revalidating-ish" cache. First user triggers the expensive query, we store the full result as bytes, and anyone loading the page afterward gets that cached snapshot. The whole thing then is automatically updated by background jobs, leaving it with a max of 30 minutes stale lifespan. Suddenly the DB could breathe again.

Until it couldn't, again.

Because naturally, we found a new way to DDoS ourselves. Our background jobs were still marching through every organization on a fixed schedule, notifying stats updates whether anyone was actually looking at the dashboard or not. So even when nobody cared, we were happily hammering the DB anyway.

To fix this, we changed how the live viewing worked. If nobody was on the stats page, the listener gets marked as "dead" after five minutes. Once dead, that org stops receiving scheduled updates entirely. No viewer, no notifications, no DB load. The moment someone opens the page again, we flip the switch back on and drop that org back into the regular background job rotation. Simple, and suddenly our database was not being bullied by ghost traffic. It does not completely eliminate the risk of us DDoSing ourselves, but it pushes us in the right direction.

Even more bricks

As larger orgs arrived, some generating more than ten million logs every week, something else broke. Our live aggregated per-org aggregations query became too heavy, even with the cache.

Twenty-four hour breakdowns were manageable. Fourteen-day breakdowns were not.

To help solve that, we introduced daily denormalization into a semi-persistent table:

Store each day’s aggregate numbers
Move over the last 3 days[2] of denormalized data every day to tolerate Purview’s delayed logs
Delete everything older than 14 days
Combine 13 days from denormalized tables with today’s live aggregated numbers

Then we hit a funny corner case. Our designers and our VP of Data wanted to show the total distincts across the entire window instead of just per day, (which was completely reasonable) turned the problem from manageable into expensive. One org in particular accessed an absurd number of unique files per day. Counting distinct files over millions of logs is expensive, and a hard problem to solve with denormalized counted values alone. SQL has APPROX_COUNT_DISTINCT, but PostgreSQL does not. To fix this, we opted for three small tables to store uniqueness at write time. Unique files per org, files accessed by AI, and SharePoint areas accessed. All cleaned up automatically after 14 days. Counting them is now a simple indexed COUNT as opposed to SELECTs with DISTINCT COUNTs.

A bag of much lighter, manageable, extendable bricks

Despite all the denormalization, we never stopped storing the raw logs in the 90-day partitioned table. That means we can still introduce new metrics, aggregate over the last 90 days, and backfill 14 days of denormalized data before rolling out new graphs.

For a system that started with "what if we just slammed logs into two databases", I would say it turned out surprisingly elegant.

Platform

Resources