Chaos Engineering is all the rage these days. So many Site Reliability Engineering (SRE) teams have started performing chaos experiments in QA testing or stage environments, while moving toward limited and expanding testing in production systems. The industry is just beginning to think about the benefits of chaos testing earlier in the process, in the CI/CD pipeline. But we expect that to change because reliability and resiliency need to shift left and start at the beginning.
This article starts with a description of a typical build pipeline, including some definitions for clarity. It then moves to answering questions about why anyone would inject chaos into a build process. Ultimately, the article ends with a high-level discussion of how we can automate well-designed chaos experiments into our CI/CD build process.
Most software development, SRE, and DevOps teams have implemented or are working to move to some form of continuous integration and continuous delivery pipeline (CI/CD). The goal is to automate the software application build process to enhance velocity. Some even expand on that with continuous deployment to production!
The high level application development process now goes something like this.
Then, they make small, atomic changes and check the modified or enhanced code back in. The code check in triggers an automated build process using a pipeline created using a git hook script or integrated pipeline software like Jenkins, Travis CI, Spinnaker, Bamboo, and the like.
The build pipeline that is implemented is different depending on how it is designed. Commonly they include stages for things like automated unit tests, automated deployment to a stage environment or a testing environment, and automated testing in that pre-production environment.
At this point, there is typically a pass/fail notification that is returned for each stage. When all of those stages are successful, there is generally a trigger to deploy to production. That trigger may be automated, especially when it is a canary deployment designed to test one or more limited instances of a new build alongside numerous instances of the previous build, but these triggers are more frequently activated by a human.
See Gremlin’s article How to Safely Manage Change in a CI/CD world for a deeper explanation of CI/CD pipelines. The article includes information about how to implement a CI/CD pipeline safely, how CI/CD is different from older software development processes, and a discussion of the change management evolution happening today. It also discusses the implementation of a deployment pipeline, which some engineering teams are using with great success, especially with cloud-native applications.
This graphic shows where different types of chaos experiments can be used effectively across the various stages of your software build cycle.
Three of the chaos experiment types shown are generally done directly and with human intervention:
- Individual experiments. This is where we have a specific idea of a test design that should give us useful data to help us improve our system’s reliability and perform a one-off experiment. We may end up repeating the experiment later to confirm system enhancements and mitigation, but this is something we do not necessarily plan to test regularly.
- GameDays and FireDrills. Here we schedule a team event with shared responsibilities and a specific focus in mind. These are also typically one-time events, again with the option to repeat later to confirm that work resulting from the GameDay or FireDrill was effective.
- Scheduled experiments. These are experiments that we run in production regularly. They may be scripted and automated or run with human interaction. Either way, they are planned, regularly-occurring events that probe for something specific that we want to remain vigilant over.
It is that last type of chaos experiment that we want to focus on here: CI/CD pipeline experiments. These are well-designed, carefully scripted and automated tests that can be run real-time during the CI/CD process, triggered at any time in the gauntlet starting from after the build is complete and beyond, all the way into production.
What we are looking for with these chaos experiments is failure. We might be testing whether a newly implemented or existing problem mitigation works as designed and continues to work as designed over time as other pieces of the application evolve. For example, is the service reliable in the event of a memory leak or a storage full error? We designed it to be (perhaps after a production incident helped us understand why this was needed), but without testing, we can’t be sure what its normal behavior will be.
Our goal here is to speed up development cycles by finding real issues before they hit production or before they have the opportunity to cause problems in production.
Another use case is to automate the testing of a canary deployment of a newly upgraded instance of a microservice that is running as one instance alongside many instances of the previous service release in our cloud infrastructure. Maybe the older service is being updated due to a bug fix or feature upgrade and you are using canary deployments as part of your software life cycle to get some testing time performed in production before completely switching to the new release version universally. Use a chaos experiment to hammer the canary service to the point where the old one had problems and see if it holds up before you start replacing all instances of the service.
We can run these tests in development. We can run them in a testing or stage deployment. We can run them alongside a QA process. We can run them in production. The most important thing is to design good chaos experiments and run them. Start small and build on successes incrementally as you gain confidence.
What we expect is to learn with greater certainty where our software works reliability, because we tested it as confirmation. We also expect to learn where we have weaknesses that we can shore up. We expect to learn this without costly downtime due to customer-impacting production outage events.
To inject Gremlin chaos attacks into a CI/CD pipeline, we use the Gremlin API. This gives us an easy way to include attacks into our pipeline scripts.
What makes it easy is that once you have your Gremlin account set up, you can obtain the authorization token info for using the API from the Gremlin website and then find CURL examples right there that you can copy and paste and use to run specific experiments. You can even design experiments in the UI and request the details to unleash the attack using the API in the same location. It is quick and painless.
Here are the high-level steps for injecting Gremlin attacks into a CI/CD pipeline:
- Use CURL to post attacks.
- While the attack is running, block and poll both our observability tools (such as Datadog or New Relic) and the Gremlin API for the current state of our environment and state of the attack being run.
- If the observability tool’s API returns results that exceed our predefined threshold, we fail the build. If the attack enters into one of the failure modes (such as lost communication), we fail the build.
It is also possible to build a standalone Java JAR file which makes these calls to the API. This JAR could be included in any CI/CD pipeline as long as there is a JVM.
Gremlin already has a native integration with Spinnaker, but doing similar work with other automation tools like Ansible, Chef, Puppet, and others will elicit similar results and can be done using direct API calls with CURL.
Getting started with Chaos Engineering takes time, especially in a CI/CD context. New things require effort and assistance. To help out, we are sharing some questions that our Customer Experience, Customer Success, Solutions Architects, and Customer Support experts have received from customers along with our answers.
Since these failure tests are done in runtime, we need to tie the attack results to monitoring metrics and events. If a build is just as resilient as it was previously, there should be no red flags raised and thus no signals. However, if a latency spike was detected, the monitoring tool should return a "build needs attention" flag of some sort.
In the UI, you manually kick off an experiment, run manual functional tests, and observe graphs and alerts. For CI/CD use, all these steps need to be replicated using automation. For the Chaos Engineering attack, you can use the Gremlin API; for the functional tests you can use Selenium; for the observability, you will use your observability tool’s API. All this can be scripted in a build job or put into a pipeline step.
It most likely is. Platforms don't change nearly as fast compared to app builds, but platforms should also have a set of tests run per release, if they are in a continuous integration platform as well.
This really depends on your build details. In general, however, these are pretty good experiments to start with:
- Reinforce scaling policies, likely by using resource attacks to confirm failover designs and mitigation schemes either work or are needed
- Create network blips, such as injecting 100ms latency or 1-2% packet loss, and observe how or if your system is affected
- Perform shutdown of a node or service to test scaling or recovery mechanisms, however this also assumes that the test build in runtime is deployed across a cluster similar to production
To give a great answer, we need to delve into the specific details of the microservice that will be simulated to fail and also the downstream microservice that depends on it.
In general, adding that dependency into the monitoring dashboard is going to be vital, however, if it's a service into which you don't have a lot of observability, it'll be harder.
This completely depends on the specific project. Individual experiments are great to help create and test dashboards and observability, validate fixes, and decide on tests for a greater GameDay.
GameDays are great for a wider audience and tests with a larger blast radius, for running FireDrills, and as mentioned above, testing that dependency effect since you'll likely have those folks in the room at the same time.
Experiments that are not yet automated but not intended for inclusion in the CI/CD pipeline include relatively small-scale tests, tests that help reinforce or confirm details after deployments make it to production, well past your CI/CD pipeline. Many of these should and could still be automated, but it is fine to test manually, especially if you stick to doing so regularly.
The reason for testing later in the process is that while tests in CI/CD are great to test one singular service, the overall ecosystem is constantly changing, so these scheduled tests stress the overall environment past those singular service changes, so there's great value to be had by testing and helping to reinforce things everywhere.
Once you get past the implementation process, there is a strong argument that running these tests will ultimately speed up overall deployment cadence because there will be fewer rollbacks needed because of bad builds.
Using scheduler prioritization helps. Something that might be best in this situation is to do this type of testing closer to the last stages of the CD pipeline, like during canary testing.
In a canary release, a small percentage of production traffic is routed to the new build (generally something like 1%), and that build is measured against the current build in production. Traffic from the canary build then gets routed back to the rest of prod.
If we run tests in canary, it wouldn't matter if other builds are also running tests as their traffic, assuming they also do canary testing, as the number is so small that it won't affect our test results (because only 1% of 1% of traffic would be running into concurrent chaos experiments).
What are some examples of experiments to run if, say an app is required to be high availability (HA) active/active?
You could run a bunch of resource attacks, some latency, maybe all the way up to a zone outage. This is likely outside of CI/CD though. For example, Chaos Kong (the Netflix region failover) was never done in an automated way.
Chaos Engineering is the next step in the evolution of our industry’s growing focus on reliability. Learning how to effectively test the resilience of our systems under a variety of stresses and loads, from the moment a build is complete all the way through production, is what will differentiate the best from the rest.
We won’t get there overnight and don’t expect to. Starting small is always better than not starting at all and gives us the opportunity to do greater things later. Incremental improvement is always better than no progress at all. Keep that in mind as you continue to evolve your CI/CD pipeline.
It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal Site Reliability Engineer