March 11, 2019

Migrating to the Cloud Is Chaotic. Embrace It.

Why organizations planning to migrate to the cloud should embrace Chaos Engineering as a thoughtful strategy to avoid pain down the road.

Migrating to the cloud is an intimidating prospect and understandably so – there is a lot that will change in your systems as you move from on-prem to the cloud, and these changes can mean instability in your systems.


How can you ensure your software will be safe after migrating to the cloud? How do you combat the cloud's chaotic nature while providing a resilient and stable system? By intentionally inducing Chaos well before migration begins.


It sounds counter-intuitive to perform Chaos Engineering while your team is actively migrating to the cloud. Wouldn't that add failure and slow down an already challenging process? The reality is that when you are migrating to the cloud, Chaos Engineering is a great way to test how your new system will behave once you switch traffic over. By performing Chaos Experiments on the environment you are migrating into, you will identify previously unknown weaknesses while you have time to mitigate against them.

This blog post will discuss a number of ways that things can go wrong and provide tutorials to run Chaos Experiments to proactively identify potential issues before they turn into production outages.

Managing Heavy CPU Load

An overloaded CPU can quickly create bottlenecks and cause failures within most architectures. In a distributed cloud environment, instability in a single system can quickly cascade into problems elsewhere down the chain. Proper CPU resilience testing helps to determine which existing systems are currently resilient to a CPU failure, and which need to be prioritized for upgrade and migration necessary to maintain a stable stack.

Performing a CPU Attack with Gremlin

A Gremlin CPU Attack consumes 100% of the specified CPU cores on the target system. The CPU Attack is a great way to test the stability of the targeted machine -- along with its critical dependencies -- when the CPU is overloaded.

Prerequisites

A CPU Attack accepts the following arguments.

Short Flag Long Flag Purpose
-c --cores Number of CPU cores to attack.
-l --length Attack duration (in seconds).

Most Gremlin API calls accept a JSON body payload, which specifies critical arguments. In all the following examples you'll be creating a local attacks/<attack-name>.json file to store the API attack arguments. You'll then pass those arguments along to the API request.

  1. On your local machine, start by creating the attacks/cpu.json file and paste the following JSON into it. This will attack a single core for 30 seconds.

    {
       "command": {
           "type": "cpu",
           "args": ["-c", "1", "-l", "30"]
       },
       "target": {
           "type": "Random"
       }
    }
  2. Create the new Attack by passing the JSON from attacks/cpu.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

    curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/cpu.json"
  3. On the targeted machine you'll see that one CPU core is maxed out.

    htop

    htop cpu max
    htop-cpu-max

  4. You can also create, run, and view the Attack on the Gremlin Web UI. web ui cpu attack

If you wish to attack a specific Client just change the target : type argument value to "Exact" and add the target : exact field with a list of target Clients. A Client is identified on Gremlin as the GREMLIN_IDENTIFIER for the instance, which can also be specified in a local environment variable when running the gremlin init command.

{
    "command": {
        "type": "cpu",
        "args": ["-c", "1", "-l", "30"]
    },
    "target": {
        "type": "Exact",
        "exact": ["aws-nginx"]
    }
}

Handling Storage Disk Limitations

Migrating to a new system frequently requires moving volumes across disks and to other cloud-based storage layers. It is vital to determine whether your new storage system can handle the increase in volume that the migration will require. Additionally, you will also want to test how the system reacts when volumes become overburdened or unavailable.

Performing a Disk Attack with Gremlin

Gremlin's Disk Attack rapidly consumes disk space on the targeted machine, allowing you to test the resiliency of that machine and other related systems when unexpected disk failures occur.

Prerequisites

A Gremlin API Disk Attack accepts the following arguments.

Short Flag Long Flag Purpose
-b --block-size The block size (in kilobytes) that are written.
-d --dir The directory that temporary files will be written to.
-l --length Attack duration (in seconds).
-p --percent The percentage of the volume to fill.
-w --workers The number of disk-write workers to run concurrently.
  1. On your local machine, start by creating the attacks/disk.json file and paste the following JSON into it. Be sure to change your target Client. This attack will fill 95% of the volume over the course of a 60-second attack using 2 workers.

    {
       "command": {
           "type": "disk",
           "args": ["-d", "/tmp", "-l", "60", "-w", "2", "-b", "4", "-p", "95"]
       },
       "target": {
           "type": "Exact",
           "exact": ["aws-nginx"]
       }
    }
  2. (Optional) Check the current disk usage on the target machine.

    df -H
    # OUTPUT
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/xvda1      8.3G  1.4G  6.9G  17% /
  3. Create the new Disk Attack by passing the JSON from attacks/disk.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

    curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/disk.json"
  4. Check the attack target's current disk space, which will soon reach the specified percentage before Gremlin rolls back and returns the disk to the original state.

    df -H
    # OUTPUT
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/xvda1      8.3G  7.9G  396M  96% /
  5. You can also create, run, and view the Attack on the Gremlin Web UI. web ui disk attack

Evaluating Network Resiliency

Network problems are a common cause of service outages. Even architectures designed with network redundancies can experience multiple, cumulative network failures. Moreover, most modern software relies on external networks to some degree, which means a network outage completely outside of your control could cause a failure to propagate throughout your system.

Performing a Black Hole Attack with Gremlin

A Black Hole Attack temporarily drops all traffic based on the parameters of the attack. You can use a Black Hole Attack to test routing protocols, loss of communication to specific hosts, port-based traffic, network device failure, and much more.

Prerequisites

A Gremlin API Black Hole Attack accepts the following arguments.

Short Flag Long Flag Purpose
-d --device Network device through which traffic should be affected. Defaults to the first device found.
-h --hostname Outgoing hostnames to affect. Optionally, you can prefix a hostname with a caret (^) to whitelist it. It is recommended to include ^api.gremlin.com in the whitelist.
-i --ipaddress Outgoing IP addresses to affect. Optionally, you can prefix an IP with a caret (^) to whitelist it.
-l --length Attack duration (in seconds).
-n --ingress_port Only affect ingress traffic to these destination ports. Ranges can also be specified (e.g. 8080-8085).
-p --egress_port Only affect egress traffic to these destination ports. Ranges can also be specified (e.g. 8080-8085).
-P --ipprotocol Only affect traffic using this protocol.
  1. Start by performing a test to establish a baseline. The following command tests the response time of a request to example.com (which has an IP address of 93.184.216.34).

    $ time curl -o /dev/null 93.184.216.34
    
    # OUTPUT
    real    0m0.025s
    user    0m0.009s
    sys     0m0.000s
  2. On your local machine, create the attacks/blackhole.json file and paste the following JSON into it. Set your target Client as necessary. This attack creates a 30-second black hole that drops traffic to the 93.184.216.34 IP address.

    {
       "command": {
           "type": "blackhole",
           "args": ["-l", "30", "-i", "93.184.216.34", "-h", "^api.gremlin.com"]
       },
       "target": {
           "type": "Exact",
           "exact": ["aws-nginx"]
       }
    }
  3. Execute the Black Hole Attack by passing the JSON from attacks/blackhole.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

    curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/blackhole.json"
  4. On the target machine run the same timed curl test as before. It now hangs for approximately 30 seconds until the black hole has been terminated and a response is finally received.

    $ time curl -o /dev/null 93.184.216.34
    
    # OUTPUT
    real    0m31.623s
    user    0m0.013s
    sys     0m0.000s
  5. You can also create, run, and view the Attack on the Gremlin Web UI. web ui blackhole attack

Proper Memory Management

While most cloud platforms provide auto-balancing and scaling services, it is unwise to rely solely on these technologies and assume they alone will keep your system stable and responsive. Memory management is a crucial part of maintaining a healthy and inexpensive cloud stack. An improper configuration or poorly tested system may not necessarily cause a system failure or outage, but even a tiny memory issue can add up to thousands of dollars in extra support costs.

Performing Chaos Engineering before, during, and after cloud migration lets you test system failures when instances, containers, or nodes run out of memory. This testing helps you keep your stack active and functional when an unexpected memory leak occurs.

Performing a Memory Attack with Gremlin

A Gremlin Memory Attack consumes memory on the targeted machine, making it easy to test how that system and other dependencies behave when memory is unavailable.

Prerequisites

A Gremlin API Memory Attack accepts the following arguments.

Short Flag Long Flag Purpose
-g --gigabytes The amount of memory (in GB) to allocate.
-l --length Attack duration (in seconds).
-m --megabytes The amount of memory (in MB) to allocate.
  1. (Optional) On the target machine check the current memory usage to establish a baseline prior to executing the attack.

    htop

    htop pre memory attack

  2. On your local machine create an attacks/memory.json file and paste the following JSON into it, ensuring you change your target Client. This attack will consume up to 0.75 GB of memory for a total of 30 seconds.

    {
       "command": {
           "type": "memory",
           "args": ["-l", "30", "-g", "0.75"]
       },
       "target": {
           "type": "Exact",
           "exact": ["aws-nginx"]
       }
    }
  3. Launch the Memory Attack by passing the JSON from attacks/memory.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

    curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/memory.json"
  4. That additional memory is now consumed on the target machine.

    htop

    htop post memory attack

  5. As always, you can view the Attack within the Gremlin Web UI. web ui io attack

Troubleshooting I/O Bottlenecks

Due to the proliferation of automatic monitoring and elastic scaling, I/O failure may seem like an unlikely problem within a cloud architecture. However, even when I/O failure isn't necessarily the root cause of an outage it is often the result of another issue. An I/O failure can trigger a negative cascading effect throughout other dependent systems. Moreover, since I/O failure is often considered an unlikely event, it is often overlooked as a test subject. It should not be overlooked.

Performing an I/O Attack with Gremlin

Gremlin's IO Attack performs rapid read and/or write actions on the targeted system volume.

Prerequisites

A Gremlin API IO Attack accepts the following arguments.

Short Flag Long Flag Purpose
-c --block-count The number of blocks read or written by workers.
-d --dir The directory that temporary files will be written to.
-l --length Attack duration (in seconds).
-m --mode Specifies if workers are in read (r), write (w), or read+write (rw) mode.
-s --block-size Size of blocks (in KB) that are read or written by workers.
-w --workers The number of concurrent workers.
  1. On your local machine create an attacks/io.json file and paste the following JSON into it. Change the target Client as necessary. This IO Attack creates two workers that will perform both reads and writes during the 45-second attack.

    {
       "command": {
           "type": "io",
           "args": ["-l", "45", "-d", "/tmp", "-w", "2", "-m", "rw", "-s", "4", "-c", "1"]
       },
       "target": {
           "type": "Exact",
           "exact": ["aws-nginx"]
       }
    }
  2. Launch the IO Attack by passing the JSON from attacks/io.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

    curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/io.json"
  3. On the target machine verify that the attack is running and that I/O is currently overloaded.

    $ sudo iotop -aoP
    # OUTPUT
    Total DISK READ :       0.00 B/s | Total DISK WRITE :       3.92 M/s
    Actual DISK READ:       0.00 B/s | Actual DISK WRITE:      15.77 M/s
    PID  PRIO  USER       DISK READ  DISK WRITE  SWAPIN     IO>     COMMAND
    323   be/3 root          0.00 B     68.00 K  0.00 % 71.28 %   [jbd2/xvda1-8]
    20030 be/4 gremlin       0.00 B    112.15 M  0.00 % 17.11 %   gremlin attack io -l 45 -d /tmp -w 2 -m rw -s 4 -c 1
  4. You can also create, run, and view the Attack on the Gremlin Web UI. web ui io attack

What Comes Next?

This article explored a number of common issues and outages related to failed migrations and upgrade procedures. As impactful and expensive as those outages may have been, their existence should not dissuade you from making the move to the cloud. A distributed architecture allows you to enjoy faster release cycles and, in general, increased developer productivity.

Instead, the occurrence of migration issues for even the biggest organizations in the industry illustrates the necessity of proper resilience testing. Chaos Engineering is a critical piece of that finished and fully-resilient puzzle. Planning ahead and running Chaos Experiments on your systems, both prior to and during migration, will help ensure you are creating the most stable, robust, and resilient system possible.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Use Gremlin for Free and see how you can harness chaos to build resilient systems.

Use For Free