20 Jul 2019, 10:00

EKS GPU Cluster from Zero to Hero


If you ever tried to run a GPU workload on Kubernetes cluster, you know that this task requires non-trivial configuration and comes with high cost tag (GPU instances are quite expensive).

This post shows how to run a GPU workload on Kubernetes cluster in cost effective way, using AWS EKS cluster, AWS Auto Scaling, Amazon EC2 Spot Instances and some Kubernetes resources and configurations.

Kubernetes with GPU Mixed ASG

EKS Cluster Plan

First we need to create a Kubernetes cluster that consists from mixed nodes, non-GPU nodes for management and generic Kubernetes workload and more expensive GPU nodes to run GPU intensive tasks, like machine learning, medical analysis, seismic exploration, video transcoding and others.

These node groups should be able to scale on demand (scale out and scale in) for generic nodes, and from 0 to required number and back to 0 for expensive GPU instances. More than that, in order to do it in cost effective way, we are going to use Amazon EC2 Spot Instances both for generic nodes and GPU nodes.

AWS EC2 Spot Instances

With Amazon EC2 Spot Instances instances you can save up to 90% comparing to On-Demand price. Previously, Spot instances were terminated in ascending order of bids. The market prices fluctuated frequently because of this. In the current model, the Spot prices are more predictable, updated less frequently, and are determined by Amazon EC2 spare capacity, not bid prices. AWS EC2 service can reclaim SPot instances when there is not enough capacity for specific instance in specific Availability Zone. Spot instances receive a 2 minute alert when are about to be reclaimed by Amazon EC2 service and can use this time for graceful shutdown and state change.

The Workflow

Create EKS Cluster

It is possible to create AWS EKS cluster, using AWS EKS CLI, CloudFormation or Terraform, AWS CDK or eksctl.

eksctl CLI tool

In this post eksctl (a CLI tool for creating clusters on EKS) is used. It is possible to pass all parameters to the tool as CLI flags or configuration file. Using configuration file makes process more repeatable and automation friendly.

eksctl can create or update EKS cluster and additional required AWS resources, using CloudFormation stacks.

Customize your cluster by using a config file. Just run

eksctl create cluster -f cluster.yaml

to apply a cluster.yaml file:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

name: test-cluster
region: us-west-2

  - name: ng
    instanceType: m5.large
    desiredCapacity: 10

A new EKS cluster with 10 m5.large On-Demand EC2 worker nodes will be created and cluster credentials will be added to ~/.kube/config file.

Creating node groups

As planned, we are going to create two node groups for Kubernetes worker nodes:

  1. General node group - autoscaling group with Spot instances to run Kubernetes system workload and non-GPU workload
  2. GPU node groups - autoscaling group with GPU-powered Spot Instances, that can scale from 0 to required number of instances and back to 0.

Fortunately, the eksctl supports adding Kubernetes node groups to EKS cluster and these groups can be composed from Spot-only instances or mixture of Spot and On-Demand instances.

General node group

The eksctl configuration file contains EKS cluster in us-west-2 across 3 Availability Zones and the first General autoscaling (from 2 to 20 nodes) node group running on diversified Spot instances.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

  name: gaia-kube
  region: us-west-2

availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]

  # spot workers NG - multi AZ, scale from 3
  - name: spot-ng
    ami: auto
    instanceType: mixed
    desiredCapacity: 2
    minSize: 2
    maxSize: 20
    volumeSize: 100
    volumeType: gp2
    volumeEncrypted: true
        - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        autoScaler: true
        ebs: true
        albIngress: true
        cloudWatch: true
      onDemandPercentageAboveBaseCapacity: 0
        - m4.2xlarge
        - m4.4xlarge
        - m5.2xlarge
        - m5.4xlarge
        - m5a.2xlarge
        - m5a.4xlarge
        - c4.2xlarge
        - c4.4xlarge
        - c5.2xlarge
        - c5.4xlarge
      spotInstancePools: 15
      k8s.io/cluster-autoscaler/enabled: 'true'
      lifecycle: Ec2Spot
    privateNetworking: true
    availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]

    ### next: GPU node groups ...

Now it is time to explain some parameters used in the above configuration file.

  • ami: auto - eksctl automatically discover latest EKS-Optimized AMI image for worker nodes, based on specified AWS region, EKS version and instance type. See Amazon EKS-Optimized AMI chapter in User Guide
  • instanceType: mixed - specify that actual instance type will be one of instance types defined in instancesDistribution section
  • iam contains list of predefined and in-place IAM policies; eksctl creates a new IAM Role with specified policies and attaches this role to every EKS worker node. There are several IAM policies you are required to attach to every EKS worker node, read Amazon EKS Worker Node IAM Role section in User Guide and eksctl IAM policies documentation
  • instancesDistribution - specify mixed instance policy for EC2 Auto Scaling Groups, read AWS MixedInstancesPolicy documentation
  • spotInstancePools - specifies number of Spot instance pools to use, read more
  • tags - AWS tags added to EKS worker nodes
    • k8s.io/cluster-autoscaler/enabled will use this tag for Kubernetes Cluster Autoscaler auto-discovery
  • privateNetworking: true - all EKS worker nodes will be placed into private subnets
Spot Instance Pools

When you are using Spot instances as worker nodes you need to diversify usage to as many Spot Instance pools as possible. A Spot Instance pool is a set of unused EC2 instances with the same instance type (for example, m5.large), operating system, Availability Zone, and network platforms.

The eksctl currently supports single Spot provisioning model: lowestPrice allocation strategy. This strategy allows creation of a fleet of Spot Instances that is both cheap and diversified. Spot Fleet automatically deploys the cheapest combination of instance types and Availability Zones based on the current Spot price across the number of Spot pools that you specify. This combination allows avoiding the most expensive Spot Instances.

The Spot instance diversification also increases worker nodes availability, typically not all Spot Instance pools will be interrupted at the same time, so only a small portion of your workload will be interrupted and EC2 Auto-scaling group will replace interrupted instances from others Spot Instance pools.

GPU-powered node group

The next part of our eksctl configuration file contains first GPU autoscaling (from 0 to 10 nodes) node group running on diversified GPU-powered Spot instances.

When using GPU-powered Spot instances, it’s recommended to create GPU node group per Availability Zone and configure Kubernetes Cluster Autoscaler to avoid automatic ASG rebalancing.

Why is it important? GPU-powered EC2 Spot Instances have relatively high Frequency of interruption rate (>20% for some GPU instance types) and using multiple AZ and disabling automatic Cluster Autoscaler balancing can help to minimize GPU workload interruptions.

  # ... EKS cluster and General node group ...

  # spot GPU NG - west-2a AZ, scale from 0
  - name: gpu-spot-ng-a
    ami: auto
    instanceType: mixed
    desiredCapacity: 0
    minSize: 0
    maxSize: 10
    volumeSize: 100
    volumeType: gp2
    volumeEncrypted: true
        - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        autoScaler: true
        ebs: true
        fsx: true
        efs: true
        albIngress: true
        cloudWatch: true
      onDemandPercentageAboveBaseCapacity: 0
        - p3.2xlarge
        - p3.8xlarge
        - p3.16xlarge
        - p2.xlarge
        - p2.8xlarge
        - p2.16xlarge
      spotInstancePools: 5
      k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true
      k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true'
      k8s.io/cluster-autoscaler/enabled: 'true'
      lifecycle: Ec2Spot
      nvidia.com/gpu: 'true'
      k8s.amazonaws.com/accelerator: nvidia-tesla
      nvidia.com/gpu: "true:NoSchedule"
    privateNetworking: true
    availabilityZones: ["us-west-2a"]

    # create additional node groups for other `us-west-2b` and `us-west-2c` availability zones ...

Now, it is time to explain some parameters used to configure GPU-powered node group.

  • ami: auto - eksctl automatically discover latest EKS-Optimized AMI image with GPU support for worker nodes, based on specified AWS region, EKS version and instance type. See Amazon EKS-Optimized AMI with GPU support User Guide
  • iam: withAddonPolicies - if a planned workload requires access to AWS storage services, it is important to include additional IAM policies (auto-generated by eksctl)
  • tags - AWS tags added to EKS worker nodes
    • k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true - Kubernetes node taint
    • k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true' - Kubernetes node label used by Cluster Autoscaler to scale ASG from/to 0
  • taints
    • nvidia.com/gpu: "true:NoSchedule" - Kubernetes GPU node taint; helps to avoid placement on non-GPU workload on expensive GPU nodes
EKS Optimized AMI image with GPU support

In addition to the standard Amazon EKS-optimized AMI configuration, the GPU AMI includes the following:

  • NVIDIA drivers
  • The nvidia-docker2 package
  • The nvidia-container-runtime (as the default runtime)
Scaling a node group to/from 0

From Kubernetes Cluster Autoscaler 0.6.1 - it is possible to scale a node group to/from 0, assuming that all scale-up and scale-down conditions are met.

If you are using nodeSelector you need to tag the ASG with a node-template key k8s.io/cluster-autoscaler/node-template/label/ and k8s.io/cluster-autoscaler/node-template/taint/ if you are using taints.

Scheduling GPU workload

Schedule based on GPU resources

The NVIDIA device plugin for Kubernetes exposes the number of GPUs on each nodes of your cluster. Once the plugin is installed, it’s possible to use nvidia/gpu Kubernetes resource on GPU nodes and for Kubernetes workloads.

Run this command to apply the Nvidia Kubernetes device plugin as a daemonset running only on AWS GPU-powered worker nodes, using tolerations and nodeAffinity

kubectl create -f kubernetes/nvidia-device-plugin.yaml

kubectl get daemonset -nkube-system

NAME                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
aws-node                              5         5         5       5            5           <none>          8d
kube-proxy                            5         5         5       5            5           <none>          8d
nvidia-device-plugin-daemonset-1.12   3         3         3       3            3           <none>          8d
ssm-agent                             5         5         5       5            5           <none>          8d

using nvidia-device-plugin.yaml Kubernetes resource file

apiVersion: extensions/v1beta1
kind: DaemonSet
  name: nvidia-device-plugin-daemonset-1.12
  namespace: kube-system
    type: RollingUpdate
        name: nvidia-device-plugin-ds
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      - image: nvidia/k8s-device-plugin:1.11
        name: nvidia-device-plugin-ctr
          allowPrivilegeEscalation: false
            drop: ["ALL"]
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
        - name: device-plugin
            path: /var/lib/kubelet/device-plugins
            - matchExpressions:
              - key: beta.kubernetes.io/instance-type
                operator: In
                - p3.2xlarge
                - p3.8xlarge
                - p3.16xlarge
                - p3dn.24xlarge
                - p2.xlarge
                - p2.8xlarge
                - p2.16xlarge
Taints and Tolerations

Kubernetes taints allow a node to repel a set of pods. Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.

See Kubernetes Taints and Tolerations documentation for more details.

In order to run GPU workload to run on GPU-powered Spot instance nodes, with nvidia.com/gpu: "true:NoSchedule" taint, the workload must include both matching tolerations and nodeSelector.

Kubernetes deployment with 10 pod replicas with nvidia/gpu: 1 limit:

apiVersion: apps/v1
kind: Deployment
  name: cuda-vector-add
    app: cuda-vector-add
  replicas: 10
      app: cuda-vector-add
      name: cuda-vector-add
        app: cuda-vector-add
        nvidia.com/gpu: "true"
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
        - name: cuda-vector-add
          # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
          image: "k8s.gcr.io/cuda-vector-add:v0.1"
              nvidia.com/gpu: 1 # requesting 1 GPU

Deploy cuda-vector-add deployment and see how new GPU-powered nodes are added to the EKS cluster.

# list Kubernetes nodes before running GPU workload
NAME                                            ID                                      TYPE
ip-192-168-151-104.us-west-2.compute.internal   aws:///us-west-2b/i-01d4c83eaee18b7b3   c4.4xlarge
ip-192-168-171-140.us-west-2.compute.internal   aws:///us-west-2c/i-07ec09fd128e1393f   c4.4xlarge

# deploy GPU workload on EKS cluster with tolerations for nvidia/gpu=true
kubectl create -f kubernetes/examples/vector/vector-add-dpl.yaml

# list Kubernetes nodes after several minutes to see new GPU nodes added to the cluster
kubectl get nodes --output="custom-columns=NAME:.metadata.name,ID:.spec.providerID,TYPE:.metadata.labels.beta\.kubernetes\.io\/instance-type"

NAME                                            ID                                      TYPE
ip-192-168-101-60.us-west-2.compute.internal    aws:///us-west-2a/i-037d1994fe96eeffc   p2.16xlarge
ip-192-168-139-227.us-west-2.compute.internal   aws:///us-west-2b/i-0038eb8d2c795fb40   p2.16xlarge
ip-192-168-151-104.us-west-2.compute.internal   aws:///us-west-2b/i-01d4c83eaee18b7b3   c4.4xlarge
ip-192-168-171-140.us-west-2.compute.internal   aws:///us-west-2c/i-07ec09fd128e1393f   c4.4xlarge
ip-192-168-179-248.us-west-2.compute.internal   aws:///us-west-2c/i-0bc0853ef26c0c054   p2.16xlarge

As you can see, 3 new GPU-powered nodes (p2.16xlarge), across 3 AZ, had been added to the cluster. When you delete GPU workload, the cluster will scale down GPU node group to 0 after 10 minutes.


Follow this tutorial to create an EKS (Kubernetes) cluster with GPU-powered node group, running on Spot instances and scalable from/to 0 nodes.



It does not matter where I work, all my opinions are my own.

08 Mar 2019, 10:00

Kubernetes Continuous Integration

Kubernetes configuration as Code

Complex Kubernetes application consists from multiple Kubernetes resources, defined in YAML files. Authoring a properly formatted YAML files that are also a valid Kubernetes specification, that should also comply to some policy can be a challenging task.

These YAML files are your application deployment and configuration code and should be addressed as code.

As with code, Continuous Integration approach should be applied to a Kubernetes configuration files.

Git Repository

Create a separate Git repository that contains Kubernetes configuration files. Define a Continuous Integration pipeline that is triggered automatically for for every change and can validate it without human intervention.


Helm helps you manage complex Kubernetes applications. Using Helm Charts you define, install, test and upgrade Kubernetes application.

Here I’m going to focus on using Helm for authoring complex Kubernetes application.

The same Kubernetes application can be installed in multiple environments: development, testing, staging and production. Helm template helps to separate application structure from environment configuration by keeping environment specific values in external files.

Dependency Management

Helm also helps with dependency management. A typical Kubernetes application can be composed from multiple services developed by other teams and open source community.

A requirements.yaml file is a YAML file in which developers can declare Helm chart dependencies, along with the location of the chart and the desired version. For example, this requirements file declares two dependencies:

Where possible, use version ranges instead of pinning to an exact version. The suggested default is to use a patch-level version match:

version: ~1.2.3

This will match version 1.2.3 and any patches to that release. In other words, ~1.2.3 is equivalent to >= 1.2.3, < 1.3.0


YAML is the most convenient way to write Kubernetes configuration files. YAML is easier for humans to read and write than other common data formats like XML or JSON. Still it’s recommended to use automatic YAML linters to avoid syntax and formatting errors.


Helm Chart Validation

Helm has a helm lint command that runs a series of tests to verify that the chart is well-formed. The helm lint also converts YAML to JSON, and this ways is able to detect some YAML errors.

helm lint mychart

==> Linting mychart
[ERROR] templates/deployment.yaml: unable to parse YAML
    error converting YAML to JSON: yaml: line 53: did not find expected '-' indicator

There are few issues with helm lint, you should be aware of: - no real YAML validation is done: only YAML to JSON conversion errors are detected - it shows wrong error line number: no the actual line in a template that contains the detected error

So, I recommend also to use YAML linter, like yamllint to perform YAML validation.

First, you need to generate Kubernetes YAML files from a Helm chart. The helm template renders chart templates locally and prints the output to the stdout.

helm template --namespace test --values dev-values.yaml

Pipe helm template and yamllint together to validate rendered YAML files.

helm template mychart | yamllint -

  41:81     error    line too long (93 > 80 characters)  (line-length)
  43:1      error    trailing spaces  (trailing-spaces)
  151:9     error    wrong indentation: expected 10 but found 8  (indentation)
  245:10    error    syntax error: expected <block end>, but found '<block sequence start>'
  293:1     error    too many blank lines (1 > 0)  (empty-lines)

Now there are multiple ways to inspect these errors:

# using vim editor
vim <(helm template mychart)

# using cat command (with line number)
cat -n <(helm template mychart)

# printing error line and few lines around it, replacing spaces with dots
helm template hermes | sed -n 240,250p | tr ' ' '⋅'

Valid Kubernetes Configuration

When authoring Kubernetes configuration files, it’s important not only check if they are valid YAML files, but if they are valid Kubernetes files.

It turns out that the Kubernetes supports OpenAPI specification and it’s possible to generate Kubernetes JSON schema for every Kubernetes API version.

Gareth Rushgrove wrote a blog post on this topic and maintains the garethr/kubernetes-json-schema GitHub repository with most recent Kubernetes and OpenShift JSON schemas for all API versions. What a great contribution to the community!

Now, with Kubernetes API JSON schema, it’s possible to validate any YAML file if it’s a valid Kubernetes configuration file.

The kubeval tool (also authored by Gareth Rushgrove) is to help.

Run kubeval piped with helm template command.

helm template mychart | kubeval

The output below sows single detected error in Service definition: invalid annotation. The kubeval could be more specific, by providing Service name, but event AS IS it is a valuable output for detecting Kubernetes configuration errors.

The document stdin contains a valid Secret
The document stdin contains a valid Secret
The document stdin contains a valid ConfigMap
The document stdin contains a valid ConfigMap
The document stdin contains a valid PersistentVolumeClaim
The document stdin contains an invalid Service
---> metadata.annotations: Invalid type. Expected: object, given: null
The document stdin contains a valid Service
The document stdin contains a valid Deployment
The document stdin contains a valid Deployment
The document stdin is empty
The document stdin is empty
The document stdin is empty
The document stdin is empty


26 Nov 2016, 16:00

Do not ignore .dockerignore


Tip: Consider to define and use .dockerignore file for every Docker image you are building. It can help you to reduce Docker image size, speedup docker build and avoid unintended secret exposure.

Overloaded container ship

Docker build context

The docker build command is used to build a new Docker image. There is one argument you can pass to the build command build context.

So, what is the Docker build context?

First, remember, that Docker is a client-server application, it consists from Docker client and Docker server (also known as daemon). The Docker client command line tool talks with Docker server and asks it do things. One of these things is build: building a new Docker image. The Docker server can run on the same machine as the client, remote machine or virtual machine, that also can be local, remote or even run on some cloud IaaS.

Why is that important and how is the Docker build context related to this fact?

In order to create a new Docker image, Docker server needs an access to files, you want to create the Docker image from. So, you need somehow to send these files to the Docker server. These files are the Docker build context. The Docker client packs all build context files into tar archive and uploads this archive to the Docker server. By default client will take all files (and folders) in current working directory and use them as the build context. It can also accept already created tar archive or git repository. In a case of git repository, the client will clone it with submodules into a temporary folder and will create a build context archive from it.

Impact on Docker build

The first output line, that you see, running the docker build command is:

Sending build context to Docker daemon 45.3 MB
Step 1: FROM ...

This should make things clear. Actually, every time you are running the docker build command, the Docker client creates a new build context archive and sends it to the Docker server. So, you are always paying this “tax”: the time it takes to create an archive, storage and network traffic and latency time.

Tip: The rule of thumb is not adding files to the build context, if you do not need them in your Docker image.

The .dockerignore file

The .dockerignore file is the tool, that can help you to define the Docker build context you really need. Using this file, you can specify ignore rules and exceptions from these rules for files and folder, that won’t be included in the build context and thus won’t be packed into an archive and uploaded to the Docker server.

Why should you care?

Indeed, why should you care? Computers today are fast, networks are also pretty fast (hopefully) and storage is cheap. So, this “tax” may be not that big, right? I will try to convince you, that you should care.

Reason #1: Docker image size

The world of software development is shifting lately towards continuous delivery, elastic infrastructure and microservice architecture.

How is that related?

Your systems are composed of multiple components (or microservices), each one of them running inside Linux container. There might be tens or hundreds of services and even more service instances. These service instances can be built and deployed independently of each other and this can be done for every single code commit. More than that, elastic infrastructure means that new compute nodes can be added or removed from the system and its microservices can move from node to node, to support scale or availability requirements. That means, your Docker images will be frequently built and transferred.

When you practice continuous delivery and microservice architecture, image size and image build time do matter.

Reason #2: Unintended secrets exposure

Not controlling your build context, can also lead to an unintended exposure of your code, commit history, and secrets (keys and credentials).

If you copy files into you Docker image with ADD . or COPY . command, you may unintendedly include your source files, whole git history (a .git folder), secret files (like .aws, .env, private keys), cache and other files not only into the Docker build context, but also into the final Docker image.

There are multiple Docker images currently available on DockerHub, that expose application source code, passwords, keys and credentials (for example Twitter Vine).

Reason #3: The Docker build - cache invalidation

A common pattern is to inject an application’s entire codebase into an image using an instruction like this:

COPY . /usr/src/app

In this case, we’re copying the entire build context into the image. It’s also important to understand, that every Dockerfile command generates a new layer. So, if any of included file changes in the entire build context, this change will invalidate the build cache for COPY . /opt/myapp layer and a new image layer will be generated on the next build.

If your working directory contains files that are frequently updated (logs, test results, git history, temporary cache files and similar), you are going to regenerate this layer for every docker build run.

The .dockerignore syntax

The .dockerignore file is similar to gitignore file, used by git tool. similarly to .gitignore file, it allows you to specify a pattern for files and folders that should be ignored by the Docker client when generating a build context. While .dockerignore file syntax used to describe ignore patterns is similar to .gitignore it’s not the same.

The .dockerignore pattern matching syntax is based on Go filepath.Match() function and includes some additions.

Here is the complete syntax for the .dockerignore:

    { term }
    '*'         matches any sequence of non-Separator characters
    '?'         matches any single non-Separator character
    '[' [ '^' ] { character-range } ']'
                character class (must be non-empty)
    c           matches character c (c != '*', '?', '\\', '[')
    '\\' c      matches character c

    c           matches character c (c != '\\', '-', ']')
    '\\' c      matches character c
    lo '-' hi   matches character c for lo <= c <= hi

  '**'        matches any number of directories (including zero)
  '!'         lines starting with ! (exclamation mark) can be used to make exceptions to exclusions
    '#'         lines starting with this character are ignored: use it for comments

Note: Using the ! character is pretty tricky. The combination of it and patterns before and after line with the ! character can be used to create more advanced rules.


# ignore .git and .cache folders
# ignore all *.class files in all folders, including build root
# ignore all markdown files (md) beside all README*.md other than README-secret.md


Hope you find this post useful. I look forward to your comments and any questions you have.

06 Oct 2016, 16:00

Docker Swarm cluster with docker-in-docker on MacOS


Docker-in-Docker dind can help you to run Docker Swarm cluster on your Macbook only with Docker for Mac (v1.12+). No virtualbox, docker-machine, vagrant or other app is required.

The Beginning

One day, I’ve decided to try running Docker 1.12 Swarm cluster on my MacBook Pro. Docker team did a great job releasing Docker for Mac, and from that time I forgot all problems I used to have with boot2docker. I really like Docker for Mac: it’s fast, lightweight, tightly integrated with MacOS and significantly simplifies my life when working in changing network environment. The only missing thing is that it’s possible to create and work with single Docker daemon running inside xhyve VM. Shit! I want a cluster.

Of cause, it’s possible to create Swarm cluster with docker-machine tool, but it’s not MacOS friendly and requires to install additional VM software, like VirtualBox or Parallels (why? I already have xhyve!). I have different network for work office and home. At work I’m behind corporate proxy with multiple firewall filters. At home, of cause, life is better. docker-machine requires to create dedicated VMs for each environment and thus force me juggling with different shell scripts when I switch from one to another. It’s possible, but it’s not fun.

I just want to have multi-node Swarm cluster with Docker for Mac (and xhyve VM). As simple as it is.

I’m a lazy person and if there is an already existing solution, I will alway choose one, even if it’s not ideal. So, after googling for a while, I’ve failed to find any suitable solution or blog post. So, I’ve decided to create my own and share it with you.

The Idea

The basic idea is to use Docker for Mac for running Swam master and several Docker-in-Docker containers for running Swarm worker nodes.

First, lets init our Swarm master:

# init Swarm master
docker swarm init

… keep Swarm join token:

# get join token
SWARM_TOKEN=$(docker swarm join-token -q worker)

… and Docker xhyve VM IP:

# get Swarm master IP (Docker for Mac xhyve VM IP)
SWARM_MASTER=$(docker info | grep -w 'Node Address' | awk '{print $3}')

… now let’s create 3 worker nodes and join these nodes to our cluster

# run NUM_WORKERS workers with SWARM_TOKEN
for i in $(seq “${NUM_WORKERS}"); do
  docker run -d --privileged --name worker-${i} --hostname=worker-${i} -p ${i}2375:2375 docker:1.12.1-dind
  docker --host=localhost:${i}2375 swarm join --token ${SWARM_TOKEN} ${SWARM_MASTER}:2377

Listing all our Swarm cluster nodes:

# list Swarm nodes :)
docker node ls

… you should see something like this:

1f6z8pioh3vuaz84gyp0biqt0    worker-2  Ready   Active
35z72o6zjhs9u1h99lrwzvx5n    worker-3  Ready   Active
d9ph5cmc604wp1vhhs754nnxx *  moby      Ready   Active        Leader
dj3gnpv86uqrw4b9mo9ux4jb5    worker-1  Ready   Active

That’s all folks! Now, you have running Swarm cluster on your Macbook and your Docker client is talking with Swarm master.

Nice tip:

You can use very nice Swarm visualizer by Mano Marks to see your Swarm cluster “in action”.

Run it with following command:

docker run -it -d -p 8000:8000 -e HOST=localhost -e PORT=8000 -v /var/run/docker.sock:/var/run/docker.sock manomarks/visualizer

And you should be able to see something like this (after you deploy some demo app):

Docker Swarm visualizer: Voting App

01 Aug 2016, 20:00

Network emulation for Docker containers


Pumba netem delay and netem loss commands can emulate network delay and packet loss between Docker containers, even on single host. Give it a try!


Microservice architecture has been adopted by software teams as a way to deliver business value faster. Container technology enables delivery of microservices into any environment. Docker has accelerated this by providing an easy to use toolset for development teams to build, ship, and run distributed applications. These applications can be composed of hundreds of microservices packaged in Docker containers.

In a recent NGINX survey [Finding #7], the “biggest challenge holding back developers” is the trade-off between quality and speed. As Martin Fowler indicates, testing strategies in microservices architecture can be very complex. Creating a realistic and useful testing environment is an aspect of this complexity.

One challenge is simulating network failures to ensure resiliency of applications and services.

The network is a critical arterial system for ensuring reliability for any distributed application. Network conditions are different depending on where the application is accessed. Network behavior can greatly impact the overall application availability, stability, performance, and user experience (UX). It’s critical to simulate and understand these impacts before the user notices. Testing for these conditions requires conducting realistic network tests.

After Docker containers are deployed in a cluster, all communication between containers happen over the network. These containers run on a single host, different hosts, different networks, and in different datacenters.

How can we test for the impact of network behavior on the application? What can we do to emulate different network properties between containers on a single host or among clusters on multiple hosts?

Pumba with Network Emulation

Pumba is a chaos testing tool for Docker containers, inspired by Netflix Chaos Monkey. The main benefit is that it works with containers instead of VMs. Pumba can kill, stop, restart running Docker containers or pause processes within specified containers. We use it for resilience testing of our distributed applications. Resilience testing ensures reliability of the system. It allows the team to verify their application recovers correctly regardless of any event (expected or unexpected) without any loss of data or functionality. Pumba simulates these events for distributed and containerized applications.

Pumba netem

We enhanced Pumba with network emulation capabilities starting with delay and packet loss. Using pumba netem command we can apply delay or packet loss on any Docker container. Under the hood, Pumba uses Linux kernel traffic control (tc) with netem queueing discipline. To work, we need to add iproute2 to Docker images, that we want to test. Some base Docker images already include iproute2 package.

Pumba netem delay and netem loss commands can emulate network delay and packet loss between Docker containers, even on a single host.

Linux has a built-in network emulation capabilities, starting from kernel 2.6.7 (released 14 years ago). Linux allows us to manipulate traffic control settings, using tc tool, available in iproute2; netem is an extension (queueing discipline) of the tc tool. It allows emulation of network properties — delay, packet loss, packer reorder, duplication, corruption, and bandwidth rate.

Pumba netem commands can help development teams simulate realistic network conditions as they build, ship, and run microservices in Docker containers.

Pumba with low level netem options, greatly simplifies its usage. We have made it easier to emulate different network properties for running Docker containers.

In the current release, Pumba modifies egress traffic only by adding delay or packet loss for specified container(s). Target containers can be specified by name (single name or as a space separated list) or via regular expression (RE2). Pumba modifies container network conditions for a specified duration. After a set time interval, Pumba restores normal network conditions. Pumba also restores the original connection with a graceful shutdown of the pumba process Ctrl-C or by stopping the Pumba container with docker stop command. An option is available to apply an IP range filter to the network emulation. With this option, Pumba will modify outgoing traffic for specified IP and will leave other outgoing traffic unchanged. Using this option, we can change network properties for a specific inter-container connection(s) as well as specific Docker networks — each Docker network has its own IP range.

Pumba delay: netem delay

To demonstrate, we’ll run two Docker containers: one is running a ping command and the other is Pumba Docker container, that adds 3 seconds network delay to the ping container for 1 minute. After 1 minute, Pumba container restores the network connection properties of the ping container as it exits gracefully.

# open two terminal windows: (1) and (2)

# terminal (1)
# create new 'tryme' Alpine container (with iproute2) and ping `www.example.com`
$ docker run -it --rm --name tryme alpine sh -c "apk add --update iproute2 && ping www.example.com"

# terminal (2)
# run pumba: add 3s delay to `tryme` container for 1m
$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba \
         pumba netem --interface eth0 --duration 1m delay --time 3000 tryme

# See `ping` delay increased by 3000ms for 1 minute
# You can stop Pumba earlier with `Ctrl-C`

netem delay examples

This section contains more advanced network emulation examples for delay command.

# add 3 seconds delay for all outgoing packets on device `eth0` (default) of `mydb` Docker container for 5 minutes

$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba \
    pumba netem --duration 5m \
      delay --time 3000 \
# add a delay of 3000ms ± 30ms, with the next random element depending 20% on the last one,
# for all outgoing packets on device `eth1` of all Docker container, with name start with `hp`
# for 10 minutes

$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba \
    pumba netem --duration 5m --interface eth1 \
      delay \
        --time 3000 \
        --jitter 30 \
        --correlation 20 \
# add a delay of 3000ms ± 40ms, where variation in delay is described by `normal` distribution,
# for all outgoing packets on device `eth0` of randomly chosen Docker container from the list
# for 10 minutes

$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba \
    pumba --random \
      netem --duration 5m \
        delay \
          --time 3000 \
          --jitter 40 \
          --distribution normal \
        container1 container2 container3

Pumba packet loss: netem loss, netem loss-state, netem loss-gemodel

Lets start with packet loss demo. Here we will run three Docker containers. iperf server and client for sending data and Pumba Docker container, that will add packer loss on client container. We are using perform network throughput tests tool iperf to demonstrate packet loss.

# open three terminal windows

# terminal (1) iperf server
# server: `-s` run in server mode; `-u` use UDP;  `-i 1` report every second
$ docker run -it --rm --name tryme-srv alpine sh -c "apk add --update iperf && iperf -s -u -i 1"

# terminal (2) iperf client
# client: `-c` client connects to <server ip>; `-u` use UDP
$ docker run -it --rm --name tryme alpine sh -c "apk add --update iproute2 iperf && iperf -c -u"

# terminal (3)
# run pumba: add 20% packet loss to `tryme` container for 1m
$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba \
         pumba netem --duration 1m loss --percent 20 tryme

# See server report on terminal (1) 'Lost/Total Datagrams' - should see lost packets there

It is generally understood that packet loss distribution in IP networks is “bursty”. To simulate more realistic packet loss events, different probability models are used. Pumba currently supports 3 different loss probability models for *packet loss”. Pumba defines separate loss command for each probability model. - loss - independent probability loss model (Bernoulli model); it’s the most widely used loss model where packet losses are modeled by a random process consisting of Bernoulli trails - loss-state - 2-state, 3-state and 4-state State Markov models - loss-gemodel - Gilbert and Gilbert-Elliott models

Papers on network packer loss models: - “Indepth: Packet Loss Burstiness” link - “Definition of a general and intuitive loss model for packet networks and its implementation in the Netem module in the Linux kernel.” link - man netem link

netem loss examples

# loss 0.3% of packets
# apply for `eth0` network interface (default) of `mydb` Docker container for 5 minutes

$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba \
    pumba netem --duration 5m \
      loss --percent 0.3 \
# loss 1.4% of packets (14 packets from 1000 will be lost)
# each successive probability (of loss) depends by a quarter on the last one
#   Prob(n) = .25 * Prob(n-1) + .75 * Random
# apply on `eth1` network interface  of Docker containers (name start with `hp`) for 15 minutes

$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba \
    pumba netem --interface eth1 --duration 15m \
      loss --percent 1.4 --correlation 25 \
# use 2-state Markov model for packet loss probability: P13=15%, P31=85%
# apply on `eth1` network interface of 3 Docker containers (c1, c2 and c3) for 12 minutes

$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba \
    pumba netem --interface eth1 --duration 12m \
      loss-state -p13 15 -p31 85 \
      c1 c2 c3
# use Gilbert-Elliot model for packet loss probability: p=5%, r=90%, (1-h)=85%, (1-k)=7%
# apply on `eth2` network interface of `mydb` Docker container for 9 minutes and 30 seconds

$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba \
    pumba netem --interface eth2 --duration 9m30s \
      loss-gemodel --pg 5 --pb 90 --one-h 85 --one-k 7 \


Special thanks to Neil Gehani for helping me with this post and to Inbar Shani for initial Pull Request with netem command.


To see more examples on how to use Pumba with [netem] commands, please refer to the Pumba GitHub Repository. We have open sourced it. We gladly accept ideas, pull requests, issues, or any other contributions.

Pumba can be downloaded as precompiled binary (Windows, Linux and MacOS) from the GitHub project release page. It’s also available as a Docker image.

Pumba GitHub Repository

16 Apr 2016, 20:00

Pumba - Chaos Testing for Docker

Update (27-07-27): Updated post to latest v0.2.0 Pumba version change.


The best defense against unexpected failures is to build resilient services. Testing for resiliency enables the teams to learn where their apps fail before the customer does. By intentionally causing failures as part of resiliency testing, you can enforce your policy for building resilient systems. Resilience of the system can be defined as its ability to continue functioning even if some components of the system are failing - ephemerality. Growing popularity of distributed and microservice architecture makes resilience testing critical for applications that now require 24x7x365 operation. Resilience testing is an approach where you intentionally inject different types of failures at the infrastructure level (VM, network, containers, and processes) and let the system try to recover from these unexpected failures that can happen in production. Simulating realistic failures at any time is the best way to enforce highly available and resilient systems.

What is Pumba?


First of all, Pumba (or Pumbaa) is a supporting character from Disney’s animated film The Lion King. In Swahili, pumbaa means “to be foolish, silly, weak-minded, careless, negligent”. This reflects the unexpected behavior of the application.

Pumba is inspired by highly popular Netfix Chaos Monkey resilience testing tool for AWS cloud. Pumba takes a similar approach but applies it at the container level. It connects to the Docker daemon running on some machine (local or remote) and brings a level of chaos to it: “randomly” killing, stopping, and removing running containers.

If your system is designed to be resilient, it should be able to recover from such failures. “Failed” services should be restarted and lost connections should be recovered. This is not as trivial as it sounds. You need to design your services differently. Be aware that a service can fail (for whatever reason) or service it depends on can disappear at any point of time (but can reappear later). Expect the unexpected!

Why run Pumba?

Failures happen and they inevitably happen when least desired. If your application cannot recover from system failures, you are going to face angry customers and maybe even loose them. If you want to be sure that your system is able to recover from unexpected failures, it would be better to take charge of them and inject failures yourself instead of waiting till they happen. This is not a one time effort. In age of Continuous Delivery, you need to be sure that every change to any one of system services, does not compromise system availability. That’s why you should practice continuous resilience testing. With Docker gaining popularity as people are deploying and running clusters of containers in production. Using a container orchestration network (e.g. Kubernetes, Swarm, CoreOS fleet), it’s possible to restart a “failed” container automatically. How can you be sure that restarted services and other system services can properly recover from failures? If you are not using container orchestration frameworks, life is even harder: you will need to handle container restarts by yourself.

This is where Pumba shines. You can run it on every Docker host, in your cluster, and Pumba will “randomly” stop running containers - matching specified name/s or name patterns. You can even specify the signal that will be sent to “kill” the container.

What Pumba can do?

Pumba can create different failures for your running Docker containers. Pumba can kill, stop or remove running containers. It can also pause all processes withing running container for specified period of time. Pumba can also do network emulation, simulating different network failures, like: delay, packet loss/corruption/reorder, bandwidth limits and more. Disclaimer: netem command is under development and only delay command is supported in Pumba v0.2.0.

You can pass list of containers to Pumba or just write a regular expression to select matching containers. If you will not specify containers, Pumba will try to disturb all running containers. Use --random option, to randomly select only one target container from provided list.

How to run Pumba?

There are two ways to run Pumba.

First, you can download Pumba application (single binary file) for your OS from project release page and run pumba help to see list of supported commands and options.

$ pumba help

Pumba version v0.2.0
   Pumba - Pumba is a resilience testing tool, that helps applications tolerate random Docker container failures: process, network and performance.

   pumba [global options] command [command options] containers (name, list of names, RE2 regex)


     kill     kill specified containers
     netem    emulate the properties of wide area networks
     pause    pause all processes
     stop     stop containers
     rm       remove containers
     help, h  Shows a list of commands or help for one command

   --host value, -H value      daemon socket to connect to (default: "unix:///var/run/docker.sock") [$DOCKER_HOST]
   --tls                       use TLS; implied by --tlsverify
   --tlsverify                 use TLS and verify the remote [$DOCKER_TLS_VERIFY]
   --tlscacert value           trust certs signed only by this CA (default: "/etc/ssl/docker/ca.pem")
   --tlscert value             client certificate for TLS authentication (default: "/etc/ssl/docker/cert.pem")
   --tlskey value              client key for TLS authentication (default: "/etc/ssl/docker/key.pem")
   --debug                     enable debug mode with verbose logging
   --json                      produce log in JSON format: Logstash and Splunk friendly
   --slackhook value           web hook url; send Pumba log events to Slack
   --slackchannel value        Slack channel (default #pumba) (default: "#pumba")
   --interval value, -i value  recurrent interval for chaos command; use with optional unit suffix: 'ms/s/m/h'
   --random, -r                randomly select single matching container from list of target containers
   --dry                       dry runl does not create chaos, only logs planned chaos commands
   --help, -h                  show help
   --version, -v               print the version

Kill Container command

$ pumba kill -h

   pumba kill - kill specified containers

   pumba kill [command options] containers (name, list of names, RE2 regex)

   send termination signal to the main process inside target container(s)

   --signal value, -s value  termination signal, that will be sent by Pumba to the main process inside target container(s) (default: "SIGKILL")

Pause Container command

$ pumba pause -h

   pumba pause - pause all processes

   pumba pause [command options] containers (name, list of names, RE2 regex)

   pause all running processes within target containers

   --duration value, -d value  pause duration: should be smaller than recurrent interval; use with optional unit suffix: 'ms/s/m/h'

Stop Container command

$ pumba stop -h
   pumba stop - stop containers

   pumba stop [command options] containers (name, list of names, RE2 regex)

   stop the main process inside target containers, sending  SIGTERM, and then SIGKILL after a grace period

   --time value, -t value  seconds to wait for stop before killing container (default 10) (default: 10)

Remove (rm) Container command

$ pumba rm -h

   pumba rm - remove containers

   pumba rm [command options] containers (name, list of names, RE2 regex)

   remove target containers, with links and voluems

   --force, -f    force the removal of a running container (with SIGKILL)
   --links, -l    remove container links
   --volumes, -v  remove volumes associated with the container

Network Emulation (netem) command

$ pumba netem -h

   Pumba netem - delay, loss, duplicate and re-order (run 'netem') packets, to emulate different network problems

   Pumba netem command [command options] [arguments...]

     delay      dealy egress traffic

   --duration value, -d value   network emulation duration; should be smaller than recurrent interval; use with optional unit suffix: 'ms/s/m/h'
   --interface value, -i value  network interface to apply delay on (default: "eth0")
   --target value, -t value     target IP filter; netem will impact only on traffic to target IP
   --help, -h                   show help

   Pumba netem - delay, loss, duplicate and re-order (run 'netem') packets, to emulate different network problems

   Pumba netem command [command options] [arguments...]

     delay      dealy egress traffic
     loss       TODO: planned to implement ...
     duplicate  TODO: planned to implement ...
     corrupt    TODO: planned to implement ...

   --duration value, -d value   network emulation duration; should be smaller than recurrent interval; use with optional unit suffix: 'ms/s/m/h'
   --interface value, -i value  network interface to apply delay on (default: "eth0")
   --target value, -t value     target IP filter; netem will impact only on traffic to target IP
   --help, -h                   show help

Network Emulation Delay sub-command

$ pumba netem delay -h

   Pumba netem delay - dealy egress traffic

   Pumba netem delay [command options] containers (name, list of names, RE2 regex)

   dealy egress traffic for specified containers; networks show variability so it is possible to add random variation; delay variation isn't purely random, so to emulate that there is a correlation

   --amount value, -a value       delay amount; in milliseconds (default: 100)
   --variation value, -v value    random delay variation; in milliseconds; example: 100ms ± 10ms (default: 10)
   --correlation value, -c value  delay correlation; in percents (default: 20)


# stop random container once in a 10 minutes
$ ./pumba --random --interval 10m kill --signal SIGSTOP
# every 15 minutes kill `mysql` container and every hour remove containers starting with "hp"
$ ./pumba --interval 15m kill --signal SIGTERM mysql &
$ ./pumba --interval 1h rm re2:^hp &
# every 30 seconds kill "worker1" and "worker2" containers and every 3 minutes stop "queue" container
$ ./pumba --interval 30s kill --signal SIGKILL worker1 worker2 &
$ ./pumba --interval 3m stop queue &
# Once in 5 minutes, Pumba will delay for 2 seconds (2000ms) egress traffic for some (randomly chosen) container,
# named `result...` (matching `^result` regexp) on `eth2` network interface.
# Pumba will restore normal connectivity after 2 minutes. Print debug trace to STDOUT too.
$ ./pumba --debug --interval 5m --random netem --duration 2m --interface eth2 delay --amount 2000 re2:^result

Running Pumba in Docker Container

The second approach to run it in a Docker container.

In order to give Pumba access to Docker daemon on host machine, you will need to mount var/run/docker.sock unix socket.

# run latest stable Pumba docker image (from master repository)
$ docker run -d -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba:master pumba kill --interval 10s --signal SIGTERM ^hp

Pumba will not kill its own container.

Note: For Mac OSX - before you run Pumba, you may want to do the following after downloading the pumba_darwin_amd64 binary:

chmod +x pumba_darwin_amd64
mv pumba_darwin_amd64 /usr/local/bin/pumba


The Pumba project is available for you to try out. We will gladly accept ideas, pull requests, issues, and contributions to the project.

Pumba GitHub Repository

07 Mar 2016, 16:39

Testing Strategies for Docker Containers

Congratulations! You know how to build a Docker image and are able to compose multiple containers into a meaningful application. Hopefully, you’ve already created a Continuous Delivery pipeline and know how to push your newly created image into production or testing environment.

Now, the question is - How do we test our Docker containers?

There are multiple testing strategies we can apply. In this post, I’ll highlight them presenting benefits and drawbacks for each.

The “Naive” approach

This is the default approach for most people. It relies on a CI server to do the job. When taking this approach, the developer is using Docker as a package manager, a better option than the jar/rpm/deb approach. The CI server compiles the application code and executes tests (unit, service, functional, and others). The build artifacts are reused in Docker build to produce a new image. This becomes a core deployment artifact. The produced image contains not only application “binaries”, but also a required runtime including all dependencies and application configuration.

We are getting application portability, however, we are loosing the development and testing portability. We’re not able to reproduce exactly the same development and testing environment outside the CI. To create a new test environment we’ll need to setup the testing tools (correct versions and plugins), configure runtime and OS settings, and get the same versions of test scripts as well as perhaps, the test data.

The Naive Testing Strategy

To resolve these problems leads us to the next one.

App & Test Container approach

Here, we try to create a single bundle with the application “binaries” including required packages, testing tools (specific versions), test tools plugins, test scripts, test environment with all required packages.

The benefits of this approach:

  • We have a repeatable test environment - we can run exactly the same tests using the same testing tools - in our CI, development, staging, or production environment
  • We capture test scripts at a specific point in time so we can always reproduce them in any environment
  • We do not need to setup and configure our testing tools - they are part of our image

This approach has significant drawbacks:

  • Increases the image size - because it contains testing tools, required packages, test scripts, and perhaps even test data
  • Pollutes image runtime environment with test specific configuration and may even introduce an unneeded dependency (required by integration testing)
  • We also need to decide what to do with the test results and logs; how and where to export them

Here’s a simplified Dockerfile. It illustrates this approach.

FROM "<bases image>":"<version>"

WORKDIR "<path>"

# install packages required to run app and tests
RUN apt-get update && apt-get install -y \
    "<app runtime> and <dependencies>" \  # add app runtime and required packages
    "<test tools> and <dependencies>" \     # add testing tools and required packages
    && rm -rf /var/lib/apt/lists/*

# copy app files
COPY app app
COPY run.sh run.sh

# copy test scripts
COPY tests tests

# copy "main" test command
COPY test.sh test.sh

# ... EXPOSE, RUN, ADD ... for app and test environment

# main app command
CMD [run.sh, "<app arguments>"]

# it's not possible to have multiple CMD commands, but this is the "main" test command
# CMD [/test.sh, "<test arguments>"]


App & Test Container

There has to be a better way for in-container testing and there is.

Test Aware Container Approach

Today, Docker’s promise is “Build -> Ship -> Run” - build the image, ship it to some registry, and run it anywhere. IMHO there’s a critical missing step - Test. The right and complete sequence should be :Build -> Test -> Ship -> Run.

Let’s look at a “test-friendly” Dockerfile syntax and extensions to Docker commands. This important step could be supported natively. It’s not a real syntax, but bear with me. I’ll define the “ideal” version and show how to implement something that’s very close.


Let’s define a special ONTEST instruction, similar to existing ONBUILD instruction. The ONTEST instruction adds a trigger instruction to the image to be executed at a later time when the image is tested. Any build instruction can be registered as a trigger.

The ONTEST instruction should be recognized by a new docker test command.

docker test [OPTIONS] IMAGE [COMMAND] [ARG...]

The docker test command syntax will be similar to docker run command, with one significant difference: a new “testable” image will be automatically generated and even tagged with <image name>:<image tag>-test tag (“test” postfix added to the original image tag). This “testable” image will generated FROM the application image, executing all build instructions, defined after ONTEST command and executing ONTEST CMD (or ONTEST ENTRYPOINT). The docker test command should return a non-zero code if any tests fail. The test results should be written into an automatically generated VOLUME that points to /var/tests/results folder.

Let’s look at a modified Dockerfile below - it includes the new proposed ONTEST instruction.

FROM "<base image>":"<version>"

WORKDIR "<path>"

# install packages required to run app
RUN apt-get update && apt-get install -y \
    "<app runtime> and <dependencies>" \  # add app runtime and required packages
    && rm -rf /var/lib/apt/lists/*

# install packages required to run tests   
ONTEST RUN apt-get update && apt-get install -y \
           "<test tools> and <dependencies>"    \     # add testing tools and required packages
           && rm -rf /var/lib/apt/lists/*

# copy app files
COPY app app
COPY run.sh run.sh

# copy test scripts
ONTEST COPY tests tests

# copy "main" test command
ONTEST COPY test.sh test.sh

# auto-generated volume for test results
# ONTEST VOLUME "/var/tests/results"

# ... EXPOSE, RUN, ADD ... for app and test environment

# main app command
CMD [run.sh, "<app arguments>"]

# main test command
ONTEST CMD [/test.sh, "<test arguments>"]


Test Aware Container

Making “Test Aware Container” Real

We believe Docker should make docker-test part of the container management lifecycle. There is a need to have a simple working solution today and I’ll describe one that’s very close to the ideal state.

As mentioned before, Docker has a very useful ONBUILD instruction. This instruction allows us to trigger another build instruction on succeeding builds. The basic idea is to use ONBUILD instruction when running docker-test command.

The flow executed by docker-test command:

  1. docker-test will search for ONBUILD instructions in application Dockerfile and will …
  2. generate a temporary Dockerfile.test from original Dockerfile
  3. execute docker build -f Dockerfile.test [OPTIONS] PATH with additional options supported by docker build command: -test that will be automatically appended to tag option
  4. If build is successful, execute docker run -v ./tests/results:/var/tests/results [OPTIONS] IMAGE:TAG-test [COMMAND] [ARG...]
  5. Remove Dockerfile.test file

Why not create a new Dockerfile.test without requiring the ONBUILD instruction?

Because in order to test right image (and tag) we’ll need to keep FROM always updated to image:tag that we want to test. This is not trivial.

There is a limitation in the described approach - it’s not suitable for “onbuild” images (images used to automatically build your app), like Maven:onbuild

Let’s look at a simple implementation of docker-test command. It highlights the concept: the docker-test command should be able to handle build and run command options and be able to handle errors properly.


echo "FROM ${image}:${tag}" > Dockerfile.test &&
docker build -t "${image}:${tag}-test" -f Dockerfile.test . &&
docker run -it --rm -v $(pwd)/tests/results:/var/tests/results "${image}:${tag}-test" &&
rm Dockerfile.test

Let’s focus on the most interesting and relevant part.

Integration Test Container

Let’s say we have an application built from tens or hundreds of microservices. Let’s say we have an automated CI/CD pipeline, where each microservice is built and tested by our CI and deployed into some environment (testing, staging or production) after the build and tests pass. Pretty cool, eh? Our CI tests are capable of testing each microservice in isolation - running unit and service tests (or API contract tests). Maybe even micro-integration tests - tests run on subsystem are created in ad-hoc manner (for example with docker compose help).

This leads to some issues that we need to address:

  • What about real integration tests or long running tests (like performance and stress)?
  • What about resilience tests (“chaos monkey” like tests)?
  • Security scans?
  • What about test and scan activities that take time and should be run on a fully operational system?

There should be a better way than just dropping a new microservice version into production and tightly monitoring it for a while.

There should be a special Integration Test Container. These containers will contain only testing tools and test artifacts: test scripts, test data, test environment configuration, etc. To simplify orchestration and automation of such containers, we should define and follow some conventions and use metadata labels (Dockerfile LABEL instruction).

Integration Test Labels

  • test.type - test type; default integration; can be one of: integration, performance, security, chaos or any text; presence of this label states that this is an Integration Test Container
  • test.results - VOLUME for test results; default /var/tests/results
  • test.XXX - any other test related metadata; just use test. prefix for label name

Integration Test Container

The Integration Test Container is just a regular Docker container. Tt does not contain any application logic and code. Its sole purpose is to create repeatable and portable testing. Recommended content of the Integration Test Container:

  • The Testing Tool - Phantom.js, Selenium, Chakram, Gatling, …
  • Testing Tool Runtime - Node.js, JVM, Python, Ruby, …
  • Test Environment Configuration - environment variables, config files, bootstrap scripts, …
  • Tests - as compiled packages or script files
  • Test Data - any kind of data files, used by tests: json, csv, txt, xml, …
  • Test Startup Script - some “main” startup script to run tests; just create test.sh and launch the testing tool from it.

Integration Test Containers should run in an operational environment where all microservices are deployed: testing, staging or production. These containers can be deployed exactly as all other services. They use same network layer and thus can access multiple services; using selected service discovery method (usually DNS). Accessing multiple services is required for real integration testing - we need to simulate and validate how our system is working in multiple places. Keeping integration tests inside some application service container not only increases the container footprint but also creates an unneeded dependency between multiple services. We keep all these dependencies at the level of the Integration Test Container. Once our tests (and testing tools) are packaged inside the container, we can always rerun the same tests on any environment including the developer machine. You can always go back in time and rerun a specific version of Integration Test Container.

Integration Test Container

WDYT? Your feedback, particularly on standardizing the docker-test command, is greatly appreciated.

14 Jan 2016, 16:14

Docker Pattern: The Build Container

Let’s say that you’re developing a microservice in a compiled language or an interpreted language that requires some additional “build” steps to package and lint your application code. This is a useful docker pattern for the “build” container.

In our project, we’re using Docker as our main deployment package: every microservice is delivered as a Docker image. Each microservice also has it’s own code repository (GitHub), and its own CI build job.

Some of our services are written in Java. Java code requires additional tooling and processes before you get working code. This tooling and all associated packages are depended from and not required when a compiled program is running.

We have been trying to develop a repeatable process and uniform environment for deploying our services. We believe it’s good to package Java tools and packages into containers too. It allows us to build a Java-based microservice on any machine, including CI, without any specific environment requirements: JDK version, profiling and testing tools, OS, Maven, environment variables, etc.

For every service, we have two(2) Dockerfiles: one for service runtime and the second packed with tools required to build the service. We usually name these files as Dockerfile and Dockerfile.build. We are using -f, --file="" flag to specify Dockerfile file name for docker build command.

Here is our Dockerfile.build file:

FROM maven:3.3.3-jdk-8

ENV GAIA_HOME=/usr/local/gaia/

RUN mkdir -p $GAIA_HOME

# speed up maven build, read https://keyholesoftware.com/2015/01/05/caching-for-maven-docker-builds/
# selectively add the POM file
ADD pom.xml $GAIA_HOME

# get all the downloads out of the way
RUN ["mvn","verify","clean","--fail-never"]

# add source

# run maven verify
RUN ["mvn","verify"]

As you can see, it’s a simple file with one little trick to speed up our Maven build.

Now, we have all the tools required to compile our service. We can run this Docker container on any machine without requiring to have JDK installed. We can run the same Docker container on a developer’s laptop and on our CI server.

This trick greatly simplifies our CI process - we no longer require our CI to support any specific compiler, version, or tool. All we need is the Docker engine. Everything else, we bring ourselves.

BYOT - Bring Your Own Tooling! :-)

In order to compile the service code, we need to build and run the builder container.

docker build -t build-img -f Dockerfile.build
docker create --name build-cont build-img

Once we’ve built the image and created a new container from this image, we have our service compiled inside the container. The only remaining task is to extract build artifacts from the container. We could use Docker volumes - this is one possible option. Actually, we like that the image we’ve created, contains all the tools and build artifacts inside it. It allows us to get any build artifacts from this image at anytime, just by creating a new container from it.

To extract our build artifacts, we are using docker cp command. This command copies files from the container to the local file system.

Here is how we are using this command:

docker cp build-cont:/usr/local/gaia/target/mgs.war ./target/mgs.war

As a result, we have a compiled service code, packaged into single WAR file. We get exactly the same WAR file on any machine just by running our builder container, or by rebuilding the builder container against the same code commit (using Git tag or specific commit ID) on any machine.

We can now create a Docker image with our service and required runtime, which is usually some version of JRE and servlet container.

Here is our Dockerfile for the service. It’s an image with Jetty JRE8 and our service WAR file.

FROM jetty:9.3.0-jre8

RUN mkdir -p /home/jetty && chown jetty:jetty /home/jetty

COPY ./target/*.war $JETTY_BASE/webapps/

By running docker build ., we have a new image with our newly “built” service.

The Recipe:

  • Have one Dockerfile with all tools and packages required to build your service. Name it Dockerfile.build or give it a name you like.
  • Have another Dockerfile with all packages required to run your service.
  • Keep both files with your service code.
  • Build a new builder image, create a container from it and extract build artifacts, using volumes or docker cp command.
  • Build the service image.
  • That’s all folks!


In our case, Java-based service, the difference between builder container and service container is huge. Java JDK is much bigger package than JRE: it also requires all X Window packages installed inside your container. For the runtime, you can have a slim image with your service code, JRE, and some basic Linux packages, or even start from scratch.

30 Dec 2015, 11:09

Docker Pattern: Deploy and update dockerized application on cluster

Docker is great technology that simplifies development and deployment of distributed applications.

While Docker serves as a core technology, many issues remain to be solved. We find ourselves struggling with some of these issues. For example:

  • How to create an elastic Docker cluster?
  • How to deploy and connect multiple containers together?
  • How to build CI/CD process?
  • How to register and discover system services and more?

For most, there are open source projects, or services available from the community as well as commercially, including from Docker, Inc.

In this post, I would like to address one such problem:

How to automate Continuous Delivery for a Docker cluster?

In our application, every service is packaged as a Docker image. Docker image is our basic deployment unit. In a runtime environment (production, testing and others), we may have multiple containers, created from this image. For each service, we have a separate Git repository and one or more Dockerfile(s). We use this for building, testing, and packaging - will explain our project structure in the next blog post.

We’ve setup an automated Continuous Integration(CI) process. We use CircleCI. For every push to some branch for the service in a Git repository, the CI service triggers a new Docker build process and creates a new Docker image for the service. As part of the build process, if everything compiles and all unit and component tests pass, we push a newly created Docker image to our DockerHub repository.

At the end of CI process, we have a newly tested and ‘kosher’ Docker image in our DockerHub repository. Very nice indeed! However, we are left with several questions:

  • How to perform a rolling update of a modified service?
  • What is the target environment (i.e. some Docker cluster)?
  • How to find the ip of a dynamically created CoreOS host (we use AWS auto-scale groups)?
  • How to connect to it in a secure way without the need to expose the SSH keys or cloud credentials (we use AWS for infrastructure)?

Our application runs on a CoreOS cluster. We have automation scripts that creates the CoreOS cluster on multiple environments: developer machine, AWS, or some VM infrustructure. When we have a new service version (i.e. Docker image), we need to find a way to deploy this service to the cluster. CoreOS uses fleet for cluster orchestration. Fleetctl is a command line client that can talk with the fleet backend and allows you to manage the CoreOS cluster. However, this only works in a local environment (machine or network). For some commands, fleetctl uses HTTP API and for others an SSH connection. The HTTP API is not protected so it makes no sense to expose it from your cloud environment. The SSH connection does not work when your architecture assumes there is no direct SSH connection to the cluster. Connecting through some SSH bastion machine, requiring the use of multiple SSH keys, and creating SSH configuration files, are not supported by the fleetctl program. I have no desire to store SSH keys or my cloud credentials for my production environment on any CI server due to security concerns.

So, what do we do?

First, we want to deploy a new service or update some service when there is a new Docker image or image version available. We also want to be selective and pick images created from code in a specific branch or even specific build.

The basic idea is to create a Docker image that captures the system configuration. This image stores the configuration of all services at a specific point in time and from specific branches. The container created from this image should be run from within the target cluster. Besides captured configuration, we also have the deployment tool (fleetctl in our case), plus some code that detects services which need to be updated, deleted, or installed as a new service.

This idea leads to another question:

How do you define and capture system configuration?

In CoreOS, every service can be described in systemd unit files. This is a plain text file that describes how and when to launch your service. I’m not going to explain how to write such files. There is a lot of info online. What’s important in our case, the service systemd unit file contains a docker run command with parameters and image name that needs to be downloaded and executed. We keep the service systemd unit file in the same repository as the service code.

The image name: usually defined as a repository/image_name:tag string. Tag is the most important thing in our solution. As explained above, our CI server automatically builds a new Docker image on every push of the service to the GitHub repository. The CI job also tags the newly created image with 2 tags: 1. branch tag — taken from GitHub branch that triggered the build (master, develop, feature-* braches in our case) 2. build_num-branch tag — where we add a running build number prefix, just before branch name As a result, in DockerHub, we have images for the latest build in any branch and also for every image we can identify the build job number and the branch it was created from.

As I said before, we keep service systemd unit file in the same repository as code. This file does not contain an image tag, only the repository and image name. See example below:

ExecStart=/usr/bin/docker run —name myservice -p 3000:3000 myrepo/myservice

Our CI service build job generates a new service systemd unit file for every successful build, replacing the above service invocation command with one that also contains a new tag. Using build_num-branch pattern (develop branch in our example), we use two simple utilities for this job: cat and sed. It’s possible to use a more advanced template engine.

ExecStart=/usr/bin/docker run — name myservice -p 3000:3000 myrepo/myservice:58-develop

As a last step, the CI build job “deploys” this new unit file to the system configuration Git repository.

git commit -am “Updating myservice tag to ‘58-develop’” myservice.service
git push origin develop

Another CI job that monitors changes in the system configuration repository will trigger a build and create a new Docker image that captures updated system configuration.

All we need to do now is to execute this Docker image on the Docker cluster. Something like this:

docker run -it —-rm=true myrepo/mysysconfig:develop

Our system configuration Dockerfile:

FROM alpine:3.2
MAINTAINER Alexei Ledenev <alexei.led@gmail.com>


ENV gaia /home/gaia
RUN mkdir -p ${gaia}

WORKDIR ${gaia}
COPY *.service ./
COPY deploy.sh ./
RUN chmod +x deploy.sh

# install packages and fleetctl client
RUN apk update && \
 apk add curl openssl bash && \
 rm -rf /var/cache/apk/* && \
 curl -L “https://github.com/coreos/fleet/releases/download/v${FLEET_VERSION}/fleet-v${FLEET_VERSION}-linux-amd64.tar.gz" | tar xz && \
 mv fleet-v${FLEET_VERSION}-linux-amd64/fleetctl /usr/local/bin/ && \
 rm -rf fleet-v${FLEET_VERSION}-linux-amd64

CMD [“/bin/bash”, “deploy.sh”]

deploy.sh is a shell script for every service check. If it needs to be updated, created, or deleted, it executes the corresponding fleetctl command.

The final step:

How do you run this “system configuration” container?

Currently, in our environment, we do this manually (from SSH shell) for development clusters and use systemd timers for CoreOS clusters on AWS. Systemd timer allows us to define a cron like job at the CoreOS cluster level.

We have plans to define a WebHook endpoint that will allow us to trigger deployment/update based on a WebHook event from the CI service.

Hope you find this post useful. I look forward to your comments and any questions you have.