Having a good data setup is key to understanding what’s going on with your product. It helps you understand how the users interact with the product, what marketing campaigns are working, and what obstacles on your website are impacting your conversions. These are all things you expect to happen. You are prepared, you’re monitoring results, running AB-tests and whatnot. But sometimes things show up in your data that you weren’t expecting. On a good day it’s a spike in your active users that you can quickly trace back to some social media post. On a bad day your numbers are spiraling, and you have exactly zero minutes to figure out why.
Did We Merge Something?
If you work in a high-competition, fast paced environment, you are probably used to continuous releases. There is no time for two weeks' notice; as soon as you’re done, it rolls out. For those of you blessed (or cursed - your choice) with scheduled releases, this post is less useful. The rest of us know that the answer to ‘did we merge something’ is always ‘plenty’.
The process of figuring out exactly what caused a problem is cumbersome, and it usually involves searching through the corners of GitHub for that one special PR that ruined your day. It’s very time consuming, and you wish you could just query the thing. The good news is that you actually can.
We often collect data about every corner of our organization, and honestly a lot of it is not as useful as we pretend. There is however a clear case for collecting data about your deployments, especially when your technology is a key component of your business. What affects your tech affects everyone.
Collecting Deployment Events
When we started discussing this at Pistachio, it wasn’t really clear how we could do it, only that it obviously had to be possible. Essentially what we were looking for was a way to send an event each time we deployed something, with some information on that deployment. It only took a few searches on the internet to find a solution that would pretty much give us what we wanted.
Because we are using Google Pub/Sub and Cloud Build, the initial setup took about 4 minutes. Cloud Build automatically sends deploy notifications to Pub/Sub, all you need to do is add the cloud-builds topic. By default, the message contains some of the relevant data we wanted for this, like the name of the repo, but we also wanted to add some additional fields, like the app we were deploying to, the owner of the deploy and the first commit message. With the setup we are running, these fields provide sufficient information to determine whether what was deployed could have affected our product data. These additional fields had to be added to our Cloud Build configurations, which wasn’t too much work but not as nice as the initial 4 minute setup.
A More Efficient Debugging Process
With the message queue up and running we started streaming the deployment data into a table in Google BigQuery. This gives us a full overview of our deployments, and it is easy to join this data with our user interaction data. Having all this in one view makes it easier to pin down when the change in the data started occurring, and what was deployed around that time.
In the case above, the pattern for confirmed interactions changes around November 24th. I filtered this view to only show deployments from apps that are relevant for this data. It tells me which of those apps had deployments in the days leading up to the change.
I can change my view to ‘zoom in’ on the more relevant time period, and look at the commit messages from the deployments in the days before and on November 24th. I can quickly spot that on November 23rd we released a feature that affected our domains, which is probably related. Some issues are obviously easier to backtrack like this than others, but the ability to quickly narrow down our search saves us time in our debugging processes, which is good for us and good for our customers.
Cloud Build’s automatic deploy notifications made this easy for us, but the same setup can be achieved with other CI/CD platforms like CircleCI or Azure DevOps through webhooks/service hooks. I highly recommend this for tech companies or other companies with tech as an integral part of their business. It allows you to see both sides of your system and watch in real time as the dance unfolds. If you are running a message-based architecture, you are basically half way there.