Matt is the Co-Founder and CTO of Gremlin, the world's first enterprise Chaos Engineering platform (raised $20M+ to date). Prior to joining Gremlin, Matt was a Senior Platform Engineer at Salesforce, and a Lead Engineer at Amazon. Matt joined FirstMark’s Guilds to explore some of the most common and important chaos engineering frameworks that help growing teams build available, resilient, and scalable products.
When Matt started working at Amazon, he was on a new team that tracked and reported production issues directly to senior management, including Jeff Bezos himself. Since they’d just finished a massive migration from monolith to microservice, many things were breaking and Matt was spending long hours trying to fix all the bugs. Frustrated, he went to his boss, who said something that would change Matt’s career trajectory, “sounds like a lot is happening to you, and you need to be happening to it.” Although this feedback wasn’t necessarily what he wanted to hear at first, it caused him to reflect. Deep down, he knew there was no way to get ahead of problems by being proactive.
This guidance fundamentally reshaped the way that Matt and his team approached chaos engineering. In this piece, we’ll review a few of Matt’s best pointers to go from a reactive organization to a proactive one.
Laying the Foundation: Understand your System
Understanding your system is the foundation for an effective chaos engineering strategy. If you don’t have a sense of what is “normal” for your system, it will be very hard to identify incidents before they happen and to manage them appropriately when they do occur. Matt points to Google’s four golden signals of monitoring to help you gauge the health of your system at any given time (developed by Google’s SRE team):
- Latency: the time it takes to service a request
- Traffic: a measure of how much demand is being placed on your system, measured in a high-level system-specific metric
- Errors: the rate of requests that fail, either explicitly, implicitly, or by policy
- Saturation: A measure of your system fraction, emphasizing the resources that are most constrained
Knowing How to Attack: Create A Response Process
After you understand where to look for issues, you have to know how to manage issues when they arise. Responding to incidents doesn’t have to be overly complex, but ensuring a process exists is essential to creating healthy reliability habits. As an earlier stage company with finite resources, Gremlin’s first step was to implement a simple and clear incident response system. They have a rotating set of on call engineers. Then, when an incident occurs, the on-call leader is responsible for getting the right team members engaged. Incident response kicks off with a triage Zoom call, plan development, and a dedicated Slack channel for asynchronous communication through resolution. (Note: for those looking for workflow solutions for incidents, FirstMark is a proud backer of Kintaba, which “manages your incident response process so you don't have to.”)
Building Resiliency through Blameless Postmortems
When an incident is resolved, it is best practice to conduct a postmortem. These meetings bring teams together to retroactively figure out how an incident happened, how the team responded, and what can be done to prevent similar incidents from happening in the future.
Gremlin runs blameless postmortems, which focus more on how the team can grow from an incident, rather than on who is to blame for the event. While postmortems aren’t unique to Gremlin, the company does put their own spin on them by inviting the entire company to each postmortem. Matt believes that having a wider audience helps the team understand the full impact of incidents on the company and its customers.
One question that often arises is whether to address some, many, or all of the root causes that emerge in a postmortem. During a postmortem, Matt suggests earmarking a few do not repeat (DNR) action items. DNR action items must be fixed immediately to ensure that certain incidents won’t happen again, and teams can’t start any feature work until DNR issues are fixed. Other non-DNR root causes are addressed as part of the standard product prioritization process.
Catching ‘Em All: Run Game Days
All of the above tactics represent the essential groundwork for teams to build a high-functioning chaos engineering practice. And once this groundwork is in place, you are well positioned to shift from a reactive strategy to a proactive one.
Where a lot of startups get stuck is in understanding what the first step looks like in a proactive strategy. Matt’s advice: “You can start slow, but don't be slow to start.” For many, the thought of implementing a chaos engineering process is perceived as an amorphous and large project, with the result that companies move too slowly to get it off the ground. Instead of treating it like a large project, treat it like a simple and small one… instead, just get started.
Enter Game Days. Game Days are a practice that Matt carried over from Amazon. They are half-, full-, or multi-day internal events in which team members identify potential ways your system could fail. With a set of possible failure cases in place, the team then runs experiments to stress test the system. Over time, you create a library of failure conditions, which enable you to either address the root cause or, at a minimum, build in effective guard rails.
As a company scales, you’ll eventually have an entire team dedicated to this practice. At Amazon, for example, Game Days eventually morphed into what is now called the Platform Excellence team.
Proactive > Reactive: Identify Ways Your System Can Fail
Once you’ve identified where to search for issues and developed a concrete response process around incidents, your process will mature to the point where you can proactively and systematically seek out issues. Matt identified three broad classes of potential failures to monitor:
- Restricted Resources: When resources are scarce, most people would assume that your architecture will seamlessly scale. But have you actually tested that assumption? Or do you wait until traffic floods them to find out?
- Some specific things to test: resource scarcity, underutilization, containers exceeding limits, clusters out of resources
- Stateless Services: How does your system handle process termination? What about host or node termination? Do you have an orchestrator? How does it kill processes?
- Some specific things to test: process termination, host or node termination, daylight savings time, certificate expiration
- Naughty Networks: What happens if there’s a network partition? Does your business continuity disaster recovery plan work? What happens if there’s increased latency?
- Some specific things to test: network partition, increased latency, some percent of packets are dropped, DNS unavailable
Proactively answering these questions can ensure that your team is prepared to address any of these issues should they arise, rather than having to tackle them in real-time as they occur.
Q&A: Chaos Engineering in Practice
What is the best environment to practice chaos engineering?
For those just getting started, Matt (perhaps unsurprisingly) recommends running tests in staging. Over time, though, teams should gradually build comfort - and the right process - for testing in production. In addition, as you’re getting your feet under you, teams can focus their initial efforts on smaller, less destructive tasks such as CPU or memory usage.
Any non-obvious team members to include in chaos engineering?
To start, the most obvious team members to include in the early days of a chaos engineering process or those who designed the systems you are testing. Given their familiarity, they’ll have a great intuition about all of the ways that system can fail.
Two other teams who have made huge contributions to chaos engineering efforts are developer advocates (if applicable) and customer-facing employees. Developer advocates bring a mix of creativity, understanding of how end users leverage your product, and strong technical know-how. Similarly, customer-facing employees - while generally non-technical - talk to customers and hear firsthand how customers use your product, problems and issues that have arisen, and more.
Testing our production database seems important. But replicating production to staging isn’t always practical, and testing in production is risky. Any ideas?
Matt and the CTO Guild shared several potential strategies here. Over the longer term, you will get comfort testing in a production environment. Before then, you can do things that boil down to artificially creating the situations you’re trying to simulate.
- Sleep statements in queries on test/staging to simulate thread exhaustion
- Constrain instance types in test/staging to limit the amount of memory or other resources available
- Execute resource-wasting “update all” SQL statements to tie up resources
Effective Chaos Engineering == Building a Proactive Muscle
Perhaps the most important takeaway from Matt: becoming great at chaos engineering, and building a highly fault tolerant and resilient product, boils down to developing a proactive process. Exploring what might happen before it does happen will help you identify problems before they become problems; and equally, better equip your team to respond to incidents when they do happen.