Protect your business with chaotic testing

May 24, 2024 Roberto Magalhães

Chaotic Testing introduces chaos into your system to strengthen it. It is a solid methodology employed by business giants to avoid system-wide outages and failures.

You are migrating your database. The process took months to take shape, hundreds of hours invested in building a compatibility pipeline, dozens of developers manually reviewing the data, meetings on weekends to meet the deadline. And then, during the migration, a power surge takes the system down…

No one could have predicted this, but this small oversight could sabotage the entire project, or worse yet, cause irreparable damage to your database.

Humans were designed to be cautious about the future, it is a biological tool designed to keep us alive in the face of new experiences. But even so, we rarely, if ever, tend to think about the worst-case scenario.

But, in the words of Augustus De Morgan, “The first experience already illustrates a truth of theory, well confirmed by practice, everything that can happen will happen if we make enough attempts”. If it sounds vaguely familiar, it's because De Morgan is often credited with being one of the first to talk about Murphy's law: “Whatever can go wrong, will go wrong.”

Highly unlikely not impossible

A friend of mine jokingly says “when you go live, the question isn’t will it explode? Instead, what color will the explosion be? ” Jokes aside, there is more than a little truth in his words.

Development environments often try to recreate production environments as close to reality as possible or vice versa. Unfortunately, computer systems are extremely volatile, so even the smallest difference can have far-reaching consequences, and that's just the tip of the iceberg.

Systems break, Internet connections have latency, servers crash, hard drives fail, data gets corrupted. These things happen and sometimes we have very little control over when or how. Remember back in 2011 when hundreds of users lost their data due to an AWS failure?

Is it likely to happen again? No, but is it possible? Yes, and that's why engineers always design fail-safe systems. If NASA had not designed a fail-safe device, Apollo 11 would have crashed on the Moon due to a computer error .

Apollo error code 1202 meant the onboard computer was overloaded with tasks. Fortunately, NASA programmers anticipated this possibility and created a backup system that would quickly reboot the computer and free up memory for new calculations.

Minimizing recovery time

The story of the Moon landing is a prime example of what modern engineers call MTTR, minimizing the time needed to recover from failure. If disasters cannot be avoided, then our solution is to minimize the time required to reactivate systems.

Let's put it this way: imagine you have two competing companies, company A is experiencing several system outages throughout the day, while company B has experienced a single outage. Without further information, everyone would like to be a B company.

But, let's say that company A's MTTR (average recovery time) is somewhere around 20 seconds, while company B's is somewhere around 4 to 6 hours. If Company A had 20 outages throughout the day, it would have a total downtime of 6 to 10 minutes. Suddenly, the frequency of system interruptions seems much less important.

How do you minimize recovery time? Well, one of the first things to do is to crash your system on purpose. This is called controlled failure. Although it may seem counterintuitive, when you think about it, it starts to make a lot of sense.

In a controlled failure, you announce the date and time that the system will fail, the failure itself is not revealed, so the team needs to diagnose the problem and get the system up and running as quickly as possible.

While this happens, we monitor system data before, during and after the failure. The goal is to aid the recovery effort, but it also provides data for subsequent analysis and improvement.

This type of exercise opens the door to new insights as the team discovers the effects of unexpected system failures. It's a shift in perspective through shock therapy, as suddenly the fail-safe system reveals its fragile self.

With the insights from these exercises, you can create new procedures with a clearer understanding of their flaws. Although exercise can be stressful, the results are worth it. And frankly, it can be one of the most intellectually challenging exercises for a development team.

Enter the Chaotic Test

Controlled failure exercises are just the tip of the iceberg and a basic introduction to chaotic testing/chaotic engineering. At its core, chaotic testing is simply about creating the ability to cause continuous but random failures in your production system.

Chaotic Engineering was a core strategy for streaming giant Netflix. The engineering team had a wide variety of “chaos monkeys,” or potential failures that could arise at any minute, from latency to a worldwide outage of Amazon Web Services.

These chaotic failures, in turn, force your team to switch from defensive development to a more aggressive approach. To be more precise, it is a method to develop resilience of the system and the team itself.

A resilient system has the flexibility to adapt to catastrophic circumstances, for example, a streaming platform that reroutes its traffic when a sudden change in latency causes a delay in data transmission.

A resilient team is flexible and open-minded, able to adapt quickly and develop new strategies as they face unforeseen problems. Resilient teams tend to view emergencies as opportunities to grow and adapt, rather than fearing them or feeling stressed.

Resilience is something that can be built, both in terms of system and team dynamics, and chaotic testing drives this mindset. It's a little like those firefighting drills we did when we were kids. By simulating a crisis, we get used to it and learn to remain calm when things get out of control.

Keep in mind that this method is extremely demanding on developers and is not recommended for newly formed teams or small-scale projects. This is suitable for projects with many moving parts, where a single bug can have far-reaching consequences.

Chaotic testing is actively recommended as one of the best methods for moving towards resilience and MTTR and is used by software and engineering giants like IBM .

Induce chaos to protect your business

Building with chaos may seem like an oxymoron, but the evidence cannot be denied. Netflix is by far one of the most solid systems on the planet, which is a testament to how good chaotic testing can be when done right.