Everyone knows about the CrowdStrike incident: A developer made a change and everything went down. It was bad.
I have a lot of sympathy for the developer who wrote that code. My mistakes don’t affect millions of computer systems globally, and I couldn’t do a fraction of the damage even if I wanted to. Working on big, important systems means you can make big, important mistakes.[1]
Mistakes don’t occur in a vacuum, either. People write bad code all the time, and our systems need to be robust to that. If an out-of-bounds memory read can result in a catastrophe so bad that the US president needs to be briefed on the situation, something is badly broken on a much grander scale.
But when I try to place my finger on that broken thing that allowed this to happen, I find that I can’t. It seems that CrowdStrike was in line with industry standards, and their customers were perfectly reasonable for trusting them as a vendor. If we’re going to avoid these incidents in the future, it might be time to realize that our industry standards can't keep us safe.
Standards and Certification
CrowdStrike isn’t some cowboy organization that YOLOs code into prod and walks away. They have a long list of certifications, including ISO27001 and SOC2, attesting to their seriousness as a vendor. CrowdStrike has proven to external auditors that they conform to standards containing controls related to testing, change management, and risk mitigation.
On the customer side, it is perfectly normal to rely on those types of certifications as evidence that a vendor has their operations in order. Of course, customers still send out lengthy vendor questionnaires of their own, but that’s less about assessing the suitability of a vendor and more about binding the vendor to certain legal representations. When a vendor asks if all data is encrypted in transit and at rest, it’s like US Immigration asking whether you are part of a terrorist organization. They know the answer, but they want you to say it so they can later use it against you if they need to. In terms of assessing whether CrowdStrike is a trustworthy vendor, it’s the certifications that carry the weight.
In other words, things worked the way they were supposed to. CrowdStrike had crossed their Ts and dotted their Is, and in doing so earned the trust of some major, systematically important companies.
A Skill Issue
Running a complex software product with massive distribution and aggressive permissions is hard, and it requires a ton of skill to do it right. There is no generic playbook that you can follow. CrowdStrike clearly messed up by not having the right processes in place to prevent this, but there’s no standard that defines what those processes should have been. I’m certainly not going to attempt to tell CrowdStrike what they “should have done”.
My advice to customers isn’t especially helpful either. The next major incident is unlikely to look like the last, so setting up measures that would have prevented this specific problem (insofar as that is even possible) probably won’t get you very far.[2] And more to the point, asking your most critical vendors to have certifications like ISO27001 and SOC2 is still a good idea. Those standards might not be enough to prevent disasters like this one, but that doesn’t mean they’re pointless.
But there is one takeaway that I think is useful, albeit a bit abstract: You can’t compliance your way to good processes, security, and reliability. Good engineering isn’t a checkbox. Standards and certifications won’t stop you from bringing your company to its knees, so that job falls on you.