Chaos Monkey Hiring Guide

Ensure system resilience! Navigate the world of hiring with our Chaos Monkey Hiring Guide, finding experts to test, challenge and strengthen your IT systems.

Desenvolvedores do Chaos Monkey

In most industries, chaos is a negative thing. In the world of chaos engineering, however, the notion of chaos is a useful, practical and enlightening tool. Chaos engineering helps developers design computer systems that are more resilient and have fewer weaknesses than traditional testing and engineering. Chaos Monkey is the most commonly used tool to create this “chaos”.

Netflix engineering team developed Chaos Monkey after migrating its systems to the cloud in 2010. This new cloud-based environment meant that hosts could be shut down and replaced at any random time, which led to the need to prepare for such constraints. . The engineering team then came up with the idea of ​​testing by randomly rebooting their own hosts. This allowed Netflix to find potential weaknesses while also validating that its own remediation automation worked correctly.

Guia de contratação do Macaco do Caos 7

Hiring Guide

Netflix designed Chaos Monkey as its own version of chaos engineering to test system stability by imposing failures on a pseudo-random execution of services and instances within its cloud architecture. Through this intentionally created chaos and the power of chaos engineering, developers and engineers have the ability to see how systems respond when critical components of their infrastructure are taken down.

At its core, chaos engineering and Chaos Monkey itself tell developers how well a system shifts its resources when faced with an interruption. This is especially useful in cloud computing instances on Amazon Web Services servers. Chaos Monkey randomly terminates instances within a virtual machine and containers running within a production environment to expose failures more frequently and help build resilient services.

A configurable schedule allows simulated failures to occur at specified times so that developers have the ability to monitor them closely. This helps you prepare for serious, unexpected mistakes, rather than simply waiting for a catastrophe and reacting after the fact. Typically, chaos engineering generally follows 4 testing steps:

  1. Engineers and developers define “steady state” as a measurable output of a system to set as a baseline for normal behavior.
  2. Teams then hypothesize how this steady state will continue and react in both the control and experimental groups during the failure simulation.
  3. Engineers introduce variables to reflect real-world problems and events that could cause catastrophic failures, such as crashes, hard drive malfunctions, severed network connections, and so on.
  4. After witnessing the system's reactions, the team attempts to disprove the hypothesis by looking for differences between the control and experimental groups.

Generally, the more difficult it is to disrupt the system's steady state, the more confidence companies and development teams will have in the system in terms of uptime and user experience. The field of chaos engineering, and specifically Chaos Monkey, is still relatively new, but these types of systems and software testers are sought after by larger companies who need to know that their systems are fully operational, regardless of the situation or associated external factors. to cloud computing. .

Interview Questions

How does an engineer build a hypothesis to test with Chaos Monkey around steady-state behavior?

Focus on the system's measurable output for testing purposes rather than internal attributes. Overall system error rates, latency percentiles, throughput, and so on may be possible metrics of interest in determining steady-state behavior.

Measuring production over a relatively short period of time is also an indicator of the steady state of the system. When working this way, “chaos” will verify that the system works by focusing on systemic behavioral patterns during experiments, rather than validating how it works.

Is it advisable to run experiments manually or automatically in Chaos Monkey and chaos engineering?

While manually running experiments helps developers create and witness system reactions, it is labor intensive and ultimately not scalable or sustainable for a team. A better practice in chaos engineering is to automate experiments and run them continuously. Chaos engineering typically incorporates automation into the system to drive both the creation of experiments and the analysis of results.

What are some of the cons of using Chaos Monkey?

Chaos Monkey is incredibly beneficial, but it has some drawbacks. Requires the use of MySQL 5.X and does not support managed deployments on anything other than Spinnaker. It offers only a limited scope of testing as it injects one failure type at a time to produce a random instance failure like “long-tail” failures experienced during the life cycle of the software or program.

Chaos Monkey also lacks a real user interface and requires execution via the command line, scripts, and configuration files. Arguably, its biggest disadvantage is the fact that it does not offer recovery capabilities. Chaos engineering encourages running the smallest possible experiments early on to contain repercussions and for engineering teams to work from there to avoid total system failure.

Job description

We are looking for an experienced engineer responsible for chaos engineering through the use of Chaos Monkey. This position includes designing and executing chaos and load testing to test high-performance systems, software, and applications.

The right candidate will utilize their knowledge of application frameworks and containerization technologies to design, manage, and maintain programs designed to emphasize and determine the sustainability and reliability of critical systems. You must be a highly motivated individual, ready to perform rapid testing while working in an agile manner to deliver reliable solutions that meet business needs.

Job responsibilities

  • Maintain and improve enterprise-wide chaos and reliability testing of technology platforms
  • Run critical path testing at scale across platforms
  • Create performance plans and models for highly scalable, low-latency, and highly available applications and infrastructure systems
  • Actively contribute to capacity planning and disaster recovery preparedness exercises
  • Monitor application performance, optimize performance bottlenecks, and manage usage to create capacity models
  • Partner with development teams to identify and create alternative plans for critical scenarios

Job Qualifications

  • Bachelor's Degree in Computer Science
  • 5+ years of experience in software engineering, MySQL, Golang and relevant programming languages
  • 4+ years of relevant professional experience in chaos engineering
  • In-depth knowledge of Chaos Monkey best practices across multiple domains including applications, networks, databases
  • Experience in monitoring strategies, including real users, synthetics, network connections, and so on

Related Content

Back to blog

Leave a comment

Please note, comments need to be approved before they are published.