16 Apr 2016, 20:00

Pumba - Chaos Testing for Docker

Update (27-07-27): Updated post to latest v0.2.0 Pumba version change.


The best defense against unexpected failures is to build resilient services. Testing for resiliency enables the teams to learn where their apps fail before the customer does. By intentionally causing failures as part of resiliency testing, you can enforce your policy for building resilient systems. Resilience of the system can be defined as its ability to continue functioning even if some components of the system are failing - ephemerality. Growing popularity of distributed and microservice architecture makes resilience testing critical for applications that now require 24x7x365 operation. Resilience testing is an approach where you intentionally inject different types of failures at the infrastructure level (VM, network, containers, and processes) and let the system try to recover from these unexpected failures that can happen in production. Simulating realistic failures at any time is the best way to enforce highly available and resilient systems.

What is Pumba?


First of all, Pumba (or Pumbaa) is a supporting character from Disney’s animated film The Lion King. In Swahili, pumbaa means “to be foolish, silly, weak-minded, careless, negligent”. This reflects the unexpected behavior of the application.

Pumba is inspired by highly popular Netfix Chaos Monkey resilience testing tool for AWS cloud. Pumba takes a similar approach but applies it at the container level. It connects to the Docker daemon running on some machine (local or remote) and brings a level of chaos to it: “randomly” killing, stopping, and removing running containers.

If your system is designed to be resilient, it should be able to recover from such failures. “Failed” services should be restarted and lost connections should be recovered. This is not as trivial as it sounds. You need to design your services differently. Be aware that a service can fail (for whatever reason) or service it depends on can disappear at any point of time (but can reappear later). Expect the unexpected!

Why run Pumba?

Failures happen and they inevitably happen when least desired. If your application cannot recover from system failures, you are going to face angry customers and maybe even loose them. If you want to be sure that your system is able to recover from unexpected failures, it would be better to take charge of them and inject failures yourself instead of waiting till they happen. This is not a one time effort. In age of Continuous Delivery, you need to be sure that every change to any one of system services, does not compromise system availability. That’s why you should practice continuous resilience testing. With Docker gaining popularity as people are deploying and running clusters of containers in production. Using a container orchestration network (e.g. Kubernetes, Swarm, CoreOS fleet), it’s possible to restart a “failed” container automatically. How can you be sure that restarted services and other system services can properly recover from failures? If you are not using container orchestration frameworks, life is even harder: you will need to handle container restarts by yourself.

This is where Pumba shines. You can run it on every Docker host, in your cluster, and Pumba will “randomly” stop running containers - matching specified name/s or name patterns. You can even specify the signal that will be sent to “kill” the container.

What Pumba can do?

Pumba can create different failures for your running Docker containers. Pumba can kill, stop or remove running containers. It can also pause all processes withing running container for specified period of time. Pumba can also do network emulation, simulating different network failures, like: delay, packet loss/corruption/reorder, bandwidth limits and more. Disclaimer: netem command is under development and only delay command is supported in Pumba v0.2.0.

You can pass list of containers to Pumba or just write a regular expression to select matching containers. If you will not specify containers, Pumba will try to disturb all running containers. Use --random option, to randomly select only one target container from provided list.

How to run Pumba?

There are two ways to run Pumba.

First, you can download Pumba application (single binary file) for your OS from project release page and run pumba help to see list of supported commands and options.

$ pumba help

Pumba version v0.2.0
   Pumba - Pumba is a resilience testing tool, that helps applications tolerate random Docker container failures: process, network and performance.

   pumba [global options] command [command options] containers (name, list of names, RE2 regex)


     kill     kill specified containers
     netem    emulate the properties of wide area networks
     pause    pause all processes
     stop     stop containers
     rm       remove containers
     help, h  Shows a list of commands or help for one command

   --host value, -H value      daemon socket to connect to (default: "unix:///var/run/docker.sock") [$DOCKER_HOST]
   --tls                       use TLS; implied by --tlsverify
   --tlsverify                 use TLS and verify the remote [$DOCKER_TLS_VERIFY]
   --tlscacert value           trust certs signed only by this CA (default: "/etc/ssl/docker/ca.pem")
   --tlscert value             client certificate for TLS authentication (default: "/etc/ssl/docker/cert.pem")
   --tlskey value              client key for TLS authentication (default: "/etc/ssl/docker/key.pem")
   --debug                     enable debug mode with verbose logging
   --json                      produce log in JSON format: Logstash and Splunk friendly
   --slackhook value           web hook url; send Pumba log events to Slack
   --slackchannel value        Slack channel (default #pumba) (default: "#pumba")
   --interval value, -i value  recurrent interval for chaos command; use with optional unit suffix: 'ms/s/m/h'
   --random, -r                randomly select single matching container from list of target containers
   --dry                       dry runl does not create chaos, only logs planned chaos commands
   --help, -h                  show help
   --version, -v               print the version

Kill Container command

$ pumba kill -h

   pumba kill - kill specified containers

   pumba kill [command options] containers (name, list of names, RE2 regex)

   send termination signal to the main process inside target container(s)

   --signal value, -s value  termination signal, that will be sent by Pumba to the main process inside target container(s) (default: "SIGKILL")

Pause Container command

$ pumba pause -h

   pumba pause - pause all processes

   pumba pause [command options] containers (name, list of names, RE2 regex)

   pause all running processes within target containers

   --duration value, -d value  pause duration: should be smaller than recurrent interval; use with optional unit suffix: 'ms/s/m/h'

Stop Container command

$ pumba stop -h
   pumba stop - stop containers

   pumba stop [command options] containers (name, list of names, RE2 regex)

   stop the main process inside target containers, sending  SIGTERM, and then SIGKILL after a grace period

   --time value, -t value  seconds to wait for stop before killing container (default 10) (default: 10)

Remove (rm) Container command

$ pumba rm -h

   pumba rm - remove containers

   pumba rm [command options] containers (name, list of names, RE2 regex)

   remove target containers, with links and voluems

   --force, -f    force the removal of a running container (with SIGKILL)
   --links, -l    remove container links
   --volumes, -v  remove volumes associated with the container

Network Emulation (netem) command

$ pumba netem -h

   Pumba netem - delay, loss, duplicate and re-order (run 'netem') packets, to emulate different network problems

   Pumba netem command [command options] [arguments...]

     delay      dealy egress traffic

   --duration value, -d value   network emulation duration; should be smaller than recurrent interval; use with optional unit suffix: 'ms/s/m/h'
   --interface value, -i value  network interface to apply delay on (default: "eth0")
   --target value, -t value     target IP filter; netem will impact only on traffic to target IP
   --help, -h                   show help

   Pumba netem - delay, loss, duplicate and re-order (run 'netem') packets, to emulate different network problems

   Pumba netem command [command options] [arguments...]

     delay      dealy egress traffic
     loss       TODO: planned to implement ...
     duplicate  TODO: planned to implement ...
     corrupt    TODO: planned to implement ...

   --duration value, -d value   network emulation duration; should be smaller than recurrent interval; use with optional unit suffix: 'ms/s/m/h'
   --interface value, -i value  network interface to apply delay on (default: "eth0")
   --target value, -t value     target IP filter; netem will impact only on traffic to target IP
   --help, -h                   show help

Network Emulation Delay sub-command

$ pumba netem delay -h

   Pumba netem delay - dealy egress traffic

   Pumba netem delay [command options] containers (name, list of names, RE2 regex)

   dealy egress traffic for specified containers; networks show variability so it is possible to add random variation; delay variation isn't purely random, so to emulate that there is a correlation

   --amount value, -a value       delay amount; in milliseconds (default: 100)
   --variation value, -v value    random delay variation; in milliseconds; example: 100ms ± 10ms (default: 10)
   --correlation value, -c value  delay correlation; in percents (default: 20)


# stop random container once in a 10 minutes
$ ./pumba --random --interval 10m kill --signal SIGSTOP
# every 15 minutes kill `mysql` container and every hour remove containers starting with "hp"
$ ./pumba --interval 15m kill --signal SIGTERM mysql &
$ ./pumba --interval 1h rm re2:^hp &
# every 30 seconds kill "worker1" and "worker2" containers and every 3 minutes stop "queue" container
$ ./pumba --interval 30s kill --signal SIGKILL worker1 worker2 &
$ ./pumba --interval 3m stop queue &
# Once in 5 minutes, Pumba will delay for 2 seconds (2000ms) egress traffic for some (randomly chosen) container,
# named `result...` (matching `^result` regexp) on `eth2` network interface.
# Pumba will restore normal connectivity after 2 minutes. Print debug trace to STDOUT too.
$ ./pumba --debug --interval 5m --random netem --duration 2m --interface eth2 delay --amount 2000 re2:^result

Running Pumba in Docker Container

The second approach to run it in a Docker container.

In order to give Pumba access to Docker daemon on host machine, you will need to mount var/run/docker.sock unix socket.

# run latest stable Pumba docker image (from master repository)
$ docker run -d -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba:master pumba kill --interval 10s --signal SIGTERM ^hp

Pumba will not kill its own container.

Note: For Mac OSX - before you run Pumba, you may want to do the following after downloading the pumba_darwin_amd64 binary:

chmod +x pumba_darwin_amd64
mv pumba_darwin_amd64 /usr/local/bin/pumba


The Pumba project is available for you to try out. We will gladly accept ideas, pull requests, issues, and contributions to the project.

Pumba GitHub Repository

07 Mar 2016, 16:39

Testing Strategies for Docker Containers

Congratulations! You know how to build a Docker image and are able to compose multiple containers into a meaningful application. Hopefully, you’ve already created a Continuous Delivery pipeline and know how to push your newly created image into production or testing environment.

Now, the question is - How do we test our Docker containers?

There are multiple testing strategies we can apply. In this post, I’ll highlight them presenting benefits and drawbacks for each.

The “Naive” approach

This is the default approach for most people. It relies on a CI server to do the job. When taking this approach, the developer is using Docker as a package manager, a better option than the jar/rpm/deb approach. The CI server compiles the application code and executes tests (unit, service, functional, and others). The build artifacts are reused in Docker build to produce a new image. This becomes a core deployment artifact. The produced image contains not only application “binaries”, but also a required runtime including all dependencies and application configuration.

We are getting application portability, however, we are loosing the development and testing portability. We’re not able to reproduce exactly the same development and testing environment outside the CI. To create a new test environment we’ll need to setup the testing tools (correct versions and plugins), configure runtime and OS settings, and get the same versions of test scripts as well as perhaps, the test data.

The Naive Testing Strategy

To resolve these problems leads us to the next one.

App & Test Container approach

Here, we try to create a single bundle with the application “binaries” including required packages, testing tools (specific versions), test tools plugins, test scripts, test environment with all required packages.

The benefits of this approach:

  • We have a repeatable test environment - we can run exactly the same tests using the same testing tools - in our CI, development, staging, or production environment
  • We capture test scripts at a specific point in time so we can always reproduce them in any environment
  • We do not need to setup and configure our testing tools - they are part of our image

This approach has significant drawbacks:

  • Increases the image size - because it contains testing tools, required packages, test scripts, and perhaps even test data
  • Pollutes image runtime environment with test specific configuration and may even introduce an unneeded dependency (required by integration testing)
  • We also need to decide what to do with the test results and logs; how and where to export them

Here’s a simplified Dockerfile. It illustrates this approach.

FROM "<bases image>":"<version>"

WORKDIR "<path>"

# install packages required to run app and tests
RUN apt-get update && apt-get install -y \
    "<app runtime> and <dependencies>" \  # add app runtime and required packages
    "<test tools> and <dependencies>" \     # add testing tools and required packages
    && rm -rf /var/lib/apt/lists/*

# copy app files
COPY app app
COPY run.sh run.sh

# copy test scripts
COPY tests tests

# copy "main" test command
COPY test.sh test.sh

# ... EXPOSE, RUN, ADD ... for app and test environment

# main app command
CMD [run.sh, "<app arguments>"]

# it's not possible to have multiple CMD commands, but this is the "main" test command
# CMD [/test.sh, "<test arguments>"]


App & Test Container

There has to be a better way for in-container testing and there is.

Test Aware Container Approach

Today, Docker’s promise is “Build -> Ship -> Run” - build the image, ship it to some registry, and run it anywhere. IMHO there’s a critical missing step - Test. The right and complete sequence should be :Build -> Test -> Ship -> Run.

Let’s look at a “test-friendly” Dockerfile syntax and extensions to Docker commands. This important step could be supported natively. It’s not a real syntax, but bear with me. I’ll define the “ideal” version and show how to implement something that’s very close.


Let’s define a special ONTEST instruction, similar to existing ONBUILD instruction. The ONTEST instruction adds a trigger instruction to the image to be executed at a later time when the image is tested. Any build instruction can be registered as a trigger.

The ONTEST instruction should be recognized by a new docker test command.

docker test [OPTIONS] IMAGE [COMMAND] [ARG...]

The docker test command syntax will be similar to docker run command, with one significant difference: a new “testable” image will be automatically generated and even tagged with <image name>:<image tag>-test tag (“test” postfix added to the original image tag). This “testable” image will generated FROM the application image, executing all build instructions, defined after ONTEST command and executing ONTEST CMD (or ONTEST ENTRYPOINT). The docker test command should return a non-zero code if any tests fail. The test results should be written into an automatically generated VOLUME that points to /var/tests/results folder.

Let’s look at a modified Dockerfile below - it includes the new proposed ONTEST instruction.

FROM "<base image>":"<version>"

WORKDIR "<path>"

# install packages required to run app
RUN apt-get update && apt-get install -y \
    "<app runtime> and <dependencies>" \  # add app runtime and required packages
    && rm -rf /var/lib/apt/lists/*

# install packages required to run tests   
ONTEST RUN apt-get update && apt-get install -y \
           "<test tools> and <dependencies>"    \     # add testing tools and required packages
           && rm -rf /var/lib/apt/lists/*

# copy app files
COPY app app
COPY run.sh run.sh

# copy test scripts
ONTEST COPY tests tests

# copy "main" test command
ONTEST COPY test.sh test.sh

# auto-generated volume for test results
# ONTEST VOLUME "/var/tests/results"

# ... EXPOSE, RUN, ADD ... for app and test environment

# main app command
CMD [run.sh, "<app arguments>"]

# main test command
ONTEST CMD [/test.sh, "<test arguments>"]


Test Aware Container

Making “Test Aware Container” Real

We believe Docker should make docker-test part of the container management lifecycle. There is a need to have a simple working solution today and I’ll describe one that’s very close to the ideal state.

As mentioned before, Docker has a very useful ONBUILD instruction. This instruction allows us to trigger another build instruction on succeeding builds. The basic idea is to use ONBUILD instruction when running docker-test command.

The flow executed by docker-test command:

  1. docker-test will search for ONBUILD instructions in application Dockerfile and will …
  2. generate a temporary Dockerfile.test from original Dockerfile
  3. execute docker build -f Dockerfile.test [OPTIONS] PATH with additional options supported by docker build command: -test that will be automatically appended to tag option
  4. If build is successful, execute docker run -v ./tests/results:/var/tests/results [OPTIONS] IMAGE:TAG-test [COMMAND] [ARG...]
  5. Remove Dockerfile.test file

Why not create a new Dockerfile.test without requiring the ONBUILD instruction?

Because in order to test right image (and tag) we’ll need to keep FROM always updated to image:tag that we want to test. This is not trivial.

There is a limitation in the described approach - it’s not suitable for “onbuild” images (images used to automatically build your app), like Maven:onbuild

Let’s look at a simple implementation of docker-test command. It highlights the concept: the docker-test command should be able to handle build and run command options and be able to handle errors properly.


echo "FROM ${image}:${tag}" > Dockerfile.test &&
docker build -t "${image}:${tag}-test" -f Dockerfile.test . &&
docker run -it --rm -v $(pwd)/tests/results:/var/tests/results "${image}:${tag}-test" &&
rm Dockerfile.test

Let’s focus on the most interesting and relevant part.

Integration Test Container

Let’s say we have an application built from tens or hundreds of microservices. Let’s say we have an automated CI/CD pipeline, where each microservice is built and tested by our CI and deployed into some environment (testing, staging or production) after the build and tests pass. Pretty cool, eh? Our CI tests are capable of testing each microservice in isolation - running unit and service tests (or API contract tests). Maybe even micro-integration tests - tests run on subsystem are created in ad-hoc manner (for example with docker compose help).

This leads to some issues that we need to address:

  • What about real integration tests or long running tests (like performance and stress)?
  • What about resilience tests (“chaos monkey” like tests)?
  • Security scans?
  • What about test and scan activities that take time and should be run on a fully operational system?

There should be a better way than just dropping a new microservice version into production and tightly monitoring it for a while.

There should be a special Integration Test Container. These containers will contain only testing tools and test artifacts: test scripts, test data, test environment configuration, etc. To simplify orchestration and automation of such containers, we should define and follow some conventions and use metadata labels (Dockerfile LABEL instruction).

Integration Test Labels

  • test.type - test type; default integration; can be one of: integration, performance, security, chaos or any text; presence of this label states that this is an Integration Test Container
  • test.results - VOLUME for test results; default /var/tests/results
  • test.XXX - any other test related metadata; just use test. prefix for label name

Integration Test Container

The Integration Test Container is just a regular Docker container. Tt does not contain any application logic and code. Its sole purpose is to create repeatable and portable testing. Recommended content of the Integration Test Container:

  • The Testing Tool - Phantom.js, Selenium, Chakram, Gatling, …
  • Testing Tool Runtime - Node.js, JVM, Python, Ruby, …
  • Test Environment Configuration - environment variables, config files, bootstrap scripts, …
  • Tests - as compiled packages or script files
  • Test Data - any kind of data files, used by tests: json, csv, txt, xml, …
  • Test Startup Script - some “main” startup script to run tests; just create test.sh and launch the testing tool from it.

Integration Test Containers should run in an operational environment where all microservices are deployed: testing, staging or production. These containers can be deployed exactly as all other services. They use same network layer and thus can access multiple services; using selected service discovery method (usually DNS). Accessing multiple services is required for real integration testing - we need to simulate and validate how our system is working in multiple places. Keeping integration tests inside some application service container not only increases the container footprint but also creates an unneeded dependency between multiple services. We keep all these dependencies at the level of the Integration Test Container. Once our tests (and testing tools) are packaged inside the container, we can always rerun the same tests on any environment including the developer machine. You can always go back in time and rerun a specific version of Integration Test Container.

Integration Test Container

WDYT? Your feedback, particularly on standardizing the docker-test command, is greatly appreciated.

14 Jan 2016, 16:14

Docker Pattern: The Build Container

Let’s say that you’re developing a microservice in a compiled language or an interpreted language that requires some additional “build” steps to package and lint your application code. This is a useful docker pattern for the “build” container.

In our project, we’re using Docker as our main deployment package: every microservice is delivered as a Docker image. Each microservice also has it’s own code repository (GitHub), and its own CI build job.

Some of our services are written in Java. Java code requires additional tooling and processes before you get working code. This tooling and all associated packages are depended from and not required when a compiled program is running.

We have been trying to develop a repeatable process and uniform environment for deploying our services. We believe it’s good to package Java tools and packages into containers too. It allows us to build a Java-based microservice on any machine, including CI, without any specific environment requirements: JDK version, profiling and testing tools, OS, Maven, environment variables, etc.

For every service, we have two(2) Dockerfiles: one for service runtime and the second packed with tools required to build the service. We usually name these files as Dockerfile and Dockerfile.build. We are using -f, --file="" flag to specify Dockerfile file name for docker build command.

Here is our Dockerfile.build file:

FROM maven:3.3.3-jdk-8

ENV GAIA_HOME=/usr/local/gaia/

RUN mkdir -p $GAIA_HOME

# speed up maven build, read https://keyholesoftware.com/2015/01/05/caching-for-maven-docker-builds/
# selectively add the POM file
ADD pom.xml $GAIA_HOME

# get all the downloads out of the way
RUN ["mvn","verify","clean","--fail-never"]

# add source

# run maven verify
RUN ["mvn","verify"]

As you can see, it’s a simple file with one little trick to speed up our Maven build.

Now, we have all the tools required to compile our service. We can run this Docker container on any machine without requiring to have JDK installed. We can run the same Docker container on a developer’s laptop and on our CI server.

This trick greatly simplifies our CI process - we no longer require our CI to support any specific compiler, version, or tool. All we need is the Docker engine. Everything else, we bring ourselves.

BYOT - Bring Your Own Tooling! :-)

In order to compile the service code, we need to build and run the builder container.

docker build -t build-img -f Dockerfile.build
docker create --name build-cont build-img

Once we’ve built the image and created a new container from this image, we have our service compiled inside the container. The only remaining task is to extract build artifacts from the container. We could use Docker volumes - this is one possible option. Actually, we like that the image we’ve created, contains all the tools and build artifacts inside it. It allows us to get any build artifacts from this image at anytime, just by creating a new container from it.

To extract our build artifacts, we are using docker cp command. This command copies files from the container to the local file system.

Here is how we are using this command:

docker cp build-cont:/usr/local/gaia/target/mgs.war ./target/mgs.war

As a result, we have a compiled service code, packaged into single WAR file. We get exactly the same WAR file on any machine just by running our builder container, or by rebuilding the builder container against the same code commit (using Git tag or specific commit ID) on any machine.

We can now create a Docker image with our service and required runtime, which is usually some version of JRE and servlet container.

Here is our Dockerfile for the service. It’s an image with Jetty JRE8 and our service WAR file.

FROM jetty:9.3.0-jre8

RUN mkdir -p /home/jetty && chown jetty:jetty /home/jetty

COPY ./target/*.war $JETTY_BASE/webapps/

By running docker build ., we have a new image with our newly “built” service.

The Recipe:

  • Have one Dockerfile with all tools and packages required to build your service. Name it Dockerfile.build or give it a name you like.
  • Have another Dockerfile with all packages required to run your service.
  • Keep both files with your service code.
  • Build a new builder image, create a container from it and extract build artifacts, using volumes or docker cp command.
  • Build the service image.
  • That’s all folks!


In our case, Java-based service, the difference between builder container and service container is huge. Java JDK is much bigger package than JRE: it also requires all X Window packages installed inside your container. For the runtime, you can have a slim image with your service code, JRE, and some basic Linux packages, or even start from scratch.

30 Dec 2015, 11:09

Docker Pattern: Deploy and update dockerized application on cluster

Docker is great technology that simplifies development and deployment of distributed applications.

While Docker serves as a core technology, many issues remain to be solved. We find ourselves struggling with some of these issues. For example:

  • How to create an elastic Docker cluster?
  • How to deploy and connect multiple containers together?
  • How to build CI/CD process?
  • How to register and discover system services and more?

For most, there are open source projects, or services available from the community as well as commercially, including from Docker, Inc.

In this post, I would like to address one such problem:

How to automate Continuous Delivery for a Docker cluster?

In our application, every service is packaged as a Docker image. Docker image is our basic deployment unit. In a runtime environment (production, testing and others), we may have multiple containers, created from this image. For each service, we have a separate Git repository and one or more Dockerfile(s). We use this for building, testing, and packaging - will explain our project structure in the next blog post.

We’ve setup an automated Continuous Integration(CI) process. We use CircleCI. For every push to some branch for the service in a Git repository, the CI service triggers a new Docker build process and creates a new Docker image for the service. As part of the build process, if everything compiles and all unit and component tests pass, we push a newly created Docker image to our DockerHub repository.

At the end of CI process, we have a newly tested and ‘kosher’ Docker image in our DockerHub repository. Very nice indeed! However, we are left with several questions:

  • How to perform a rolling update of a modified service?
  • What is the target environment (i.e. some Docker cluster)?
  • How to find the ip of a dynamically created CoreOS host (we use AWS auto-scale groups)?
  • How to connect to it in a secure way without the need to expose the SSH keys or cloud credentials (we use AWS for infrastructure)?

Our application runs on a CoreOS cluster. We have automation scripts that creates the CoreOS cluster on multiple environments: developer machine, AWS, or some VM infrustructure. When we have a new service version (i.e. Docker image), we need to find a way to deploy this service to the cluster. CoreOS uses fleet for cluster orchestration. Fleetctl is a command line client that can talk with the fleet backend and allows you to manage the CoreOS cluster. However, this only works in a local environment (machine or network). For some commands, fleetctl uses HTTP API and for others an SSH connection. The HTTP API is not protected so it makes no sense to expose it from your cloud environment. The SSH connection does not work when your architecture assumes there is no direct SSH connection to the cluster. Connecting through some SSH bastion machine, requiring the use of multiple SSH keys, and creating SSH configuration files, are not supported by the fleetctl program. I have no desire to store SSH keys or my cloud credentials for my production environment on any CI server due to security concerns.

So, what do we do?

First, we want to deploy a new service or update some service when there is a new Docker image or image version available. We also want to be selective and pick images created from code in a specific branch or even specific build.

The basic idea is to create a Docker image that captures the system configuration. This image stores the configuration of all services at a specific point in time and from specific branches. The container created from this image should be run from within the target cluster. Besides captured configuration, we also have the deployment tool (fleetctl in our case), plus some code that detects services which need to be updated, deleted, or installed as a new service.

This idea leads to another question:

How do you define and capture system configuration?

In CoreOS, every service can be described in systemd unit files. This is a plain text file that describes how and when to launch your service. I’m not going to explain how to write such files. There is a lot of info online. What’s important in our case, the service systemd unit file contains a docker run command with parameters and image name that needs to be downloaded and executed. We keep the service systemd unit file in the same repository as the service code.

The image name: usually defined as a repository/image_name:tag string. Tag is the most important thing in our solution. As explained above, our CI server automatically builds a new Docker image on every push of the service to the GitHub repository. The CI job also tags the newly created image with 2 tags: 1. branch tag — taken from GitHub branch that triggered the build (master, develop, feature-* braches in our case) 2. build_num-branch tag — where we add a running build number prefix, just before branch name As a result, in DockerHub, we have images for the latest build in any branch and also for every image we can identify the build job number and the branch it was created from.

As I said before, we keep service systemd unit file in the same repository as code. This file does not contain an image tag, only the repository and image name. See example below:

ExecStart=/usr/bin/docker run —name myservice -p 3000:3000 myrepo/myservice

Our CI service build job generates a new service systemd unit file for every successful build, replacing the above service invocation command with one that also contains a new tag. Using build_num-branch pattern (develop branch in our example), we use two simple utilities for this job: cat and sed. It’s possible to use a more advanced template engine.

ExecStart=/usr/bin/docker run — name myservice -p 3000:3000 myrepo/myservice:58-develop

As a last step, the CI build job “deploys” this new unit file to the system configuration Git repository.

git commit -am “Updating myservice tag to ‘58-develop’” myservice.service
git push origin develop

Another CI job that monitors changes in the system configuration repository will trigger a build and create a new Docker image that captures updated system configuration.

All we need to do now is to execute this Docker image on the Docker cluster. Something like this:

docker run -it —-rm=true myrepo/mysysconfig:develop

Our system configuration Dockerfile:

FROM alpine:3.2
MAINTAINER Alexei Ledenev <alexei.led@gmail.com>


ENV gaia /home/gaia
RUN mkdir -p ${gaia}

WORKDIR ${gaia}
COPY *.service ./
COPY deploy.sh ./
RUN chmod +x deploy.sh

# install packages and fleetctl client
RUN apk update && \
 apk add curl openssl bash && \
 rm -rf /var/cache/apk/* && \
 curl -L “https://github.com/coreos/fleet/releases/download/v${FLEET_VERSION}/fleet-v${FLEET_VERSION}-linux-amd64.tar.gz" | tar xz && \
 mv fleet-v${FLEET_VERSION}-linux-amd64/fleetctl /usr/local/bin/ && \
 rm -rf fleet-v${FLEET_VERSION}-linux-amd64

CMD [“/bin/bash”, “deploy.sh”]

deploy.sh is a shell script for every service check. If it needs to be updated, created, or deleted, it executes the corresponding fleetctl command.

The final step:

How do you run this “system configuration” container?

Currently, in our environment, we do this manually (from SSH shell) for development clusters and use systemd timers for CoreOS clusters on AWS. Systemd timer allows us to define a cron like job at the CoreOS cluster level.

We have plans to define a WebHook endpoint that will allow us to trigger deployment/update based on a WebHook event from the CI service.

Hope you find this post useful. I look forward to your comments and any questions you have.

30 Dec 2015, 11:09


Alexei Ledenev

Software Developer