27 Jul 2019, 10:00

Get a Shell to a Kubernetes Node

Linux Shell

Throughout the lifecycle of your Kubernetes cluster, you may need to access a cluster worker node. This access could be for maintenance, configuration inspection, log collection, or other troubleshooting operations. More than that, it would be nice, if you could enable this access whenever it’s needed and disable when you finish your task.

SSH Approach

While it’s possible to configure Kubernetes nodes with SSH access, this also makes worker nodes more vulnerable. Using SSH requires a network connection between the engineer’s machine and the EC2 instance, something you may want to avoid. Some users set up a jump server (also called bastion host) as a typical pattern to minimize the attack surface from the Internet. But this approach still requires from you to manage access to the bastion servers and protect SSH keys. IMHO, managing supporting SSH infrastructure, is a high price to pay, especially if you just wanted to get a shell access to a worker node or to run some commands.

Kubernetes Approach

The Kubernetes command line tool, kubectl, allows you to run different commands against a Kubernetes cluster. You can manipulate Kubernetes API objects, manage worker nodes, inspect cluster, execute commands inside running container, and get an interactive shell to a running container.

Suppose you have a pod, named shell-demo. To get a shell to the running container on this pod, just run:

kubectl exec -it shell-demo -- /bin/bash

# see shell prompt ...
root@shell-demo:/#

How Does exec Work?

kubectl exec invokes Kubernetes API Server and it “asks” a Kubelet “node agent” to run an exec command against CRI (Container Runtime Interface), most frequently it is a Docker runtime.

The docker exec API/command creates a new process, sets its namespaces to a target container’s namespaces and then executes the requested command, handling also input and output streams for created process.

The Idea

A Linux system starts out with a single namespace of each type (mount, process, ipc, network, UTS, and user), used by all processes.

So, we need to do is to run a new pod, and connect it to a worker node host namespaces.

A Helper Program

It is possible to use any Docker image with shell on board as a “host shell” container. There is one limitation, you should be aware of - it’s not possible to join mount namespace of target container (or host).

The nsenter is a small program from util-linux package, that can run program with namespaces (and cgroups) of other processes. Exactly what we need!

Most Linux distros ship with an outdated version of util-linux. So, I prepared the alexeiled/nsenter Docker image with nsenter program on-board. This is a super small Docker image, of 900K size, created from scratch image and a single statically linked nsenter binary (v2.34).

Use the helper script below, also available in alexei-led/nsenter GitHub repository, to run a new nsenter pod on specified Kubernetes worker node. This helper script create a privileged nsenter pod in a host’s process and network namespaces, running nsenter with --all flag, joining all namespaces and cgroups and running a default shell as a superuser (with su - command).

The nodeSelector makes it possible to specify a target Kubernetes node to run nsenter pod on. The "tolerations": [{"operator": "Exists"}] parameter helps to match any node taint, if specified.

Helper script

# get cluster nodes
kubectl get nodes

# output
NAME                                            STATUS   ROLES    AGE     VERSION
ip-192-168-151-104.us-west-2.compute.internal   Ready    <none>   8d      v1.13.7-eks-c57ff8
ip-192-168-171-140.us-west-2.compute.internal   Ready    <none>   7d11h   v1.13.7-eks-c57ff8

# open superuser shell on specified node
./nsenter-node.sh ip-192-168-151-104.us-west-2.compute.internal

# prompt
[root@ip-192-168-151-104 ~]#

# pod will be destroyed on exit
...
nsenter-node.sh
#!/bin/sh
set -x

node=${1}
nodeName=$(kubectl get node ${node} -o template --template='{{index .metadata.labels "kubernetes.io/hostname"}}') 
nodeSelector='"nodeSelector": { "kubernetes.io/hostname": "'${nodeName:?}'" },'
podName=${USER}-nsenter-${node}

kubectl run ${podName:?} --restart=Never -it --rm --image overriden --overrides '
{
  "spec": {
    "hostPID": true,
    "hostNetwork": true,
    '"${nodeSelector?}"'
    "tolerations": [{
        "operator": "Exists"
    }],
    "containers": [
      {
        "name": "nsenter",
        "image": "alexeiled/nsenter:2.34",
        "command": [
          "/nsenter", "--all", "--target=1", "--", "su", "-"
        ],
        "stdin": true,
        "tty": true,
        "securityContext": {
          "privileged": true
        }
      }
    ]
  }
}' --attach "$@"

Management of Kubernetes worker nodes on AWS

When running a Kubernetes cluster on AWS, Amazon EKS or self-managed Kubernetes cluster, it is possible to manage Kubernetes nodes with [AWS Systems Manager]https://aws.amazon.com/systems-manager/). Using AWS Systems Manager (AWS SSM), you can automate multiple management tasks, apply patches and updates, run commands, and access shell on any managed node, without a need of maintaining SSH infrastructure.

In order to manage a Kubernetes node (AWS EC2 host), you need to install and start a SSM Agent daemon, see AWS documentation for more details.

But we are taking a Kubernetes approach, and this means we are going to run a SSM Agent as a daemonset on every Kubernetes node in a cluster. This approach allows you to run an updated version SSM Agent without a need to install it into a host machine and do it only when needed.

Pre-request

First, you need to attach the AmazonEC2RoleforSSM policy to Kubernetes worker nodes instance role. Without this policy, you wont be able to manage Kubernetes worker nodes with AWS SSM.

Setup

Then, clone the alexei-led/kube-ssm-agent GitHub repository. It contains a properly configured SSM Agent daemonset file.

The daemonset uses the alexeiled/aws-ssm-agent:<ver> Docker image that contains:

  1. AWS SSM Agent, the same version as Docker image tag
  2. Docker CLI client
  3. AWS CLI client
  4. Vim and additional useful programs

Run to deploy a new SSM Agent daemonset:

kubectl create -f daemonset.yaml

Once SSM Agent daemonset is running you can run any aws ssm command.

Run to start a new SSM terminal session:

AWS_DEFAULT_REGION=us-west-2 aws ssm start-session --target <instance-id>

starting session with SessionId: ...

sh-4.2$ ls
sh-4.2$ pwd
/opt/amazon/ssm
sh-4.2$ bash -i
[ssm-user@ip-192-168-84-111 ssm]$

[ssm-user@ip-192-168-84-111 ssm]$ exit
sh-4.2$ exit

Exiting session with sessionId: ...

The daemonset.yaml file

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ssm-agent
  labels:
    k8s-app: ssm-agent
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: ssm-agent
  template:
    metadata:
      labels:
        name: ssm-agent
    spec:
      # join host network namespace
      hostNetwork: true
      # join host process namespace
      hostPID: true
      # join host IPC namespace
      hostIPC: true 
      # tolerations
      tolerations:
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      containers:
      - image: alexeiled/aws-ssm-agent:2.3.680
        imagePullPolicy: Always
        name: ssm-agent
        securityContext:
          runAsUser: 0
          privileged: true
        volumeMounts:
        # Allows systemctl to communicate with the systemd running on the host
        - name: dbus
          mountPath: /var/run/dbus
        - name: run-systemd
          mountPath: /run/systemd
        # Allows to peek into systemd units that are baked into the official EKS AMI
        - name: etc-systemd
          mountPath: /etc/systemd
        # This is needed in order to fetch logs NOT managed by journald
        # journallog is stored only in memory by default, so we need
        #
        # If all you need is access to persistent journals, /var/log/journal/* would be enough
        # FYI, the volatile log store /var/run/journal was empty on my nodes. Perhaps it isn't used in Amazon Linux 2 / EKS AMI?
        # See https://askubuntu.com/a/1082910 for more background
        - name: var-log
          mountPath: /var/log
        - name: var-run
          mountPath: /var/run
        - name: run
          mountPath: /run
        - name: usr-lib-systemd
          mountPath: /usr/lib/systemd
        - name: etc-machine-id
          mountPath: /etc/machine-id
        - name: etc-sudoers
          mountPath: /etc/sudoers.d
      volumes:
      # for systemctl to systemd access
      - name: dbus
        hostPath:
          path: /var/run/dbus
          type: Directory
      - name: run-systemd
        hostPath:
          path: /run/systemd
          type: Directory
      - name: etc-systemd
        hostPath:
          path: /etc/systemd
          type: Directory
      - name: var-log
        hostPath:
          path: /var/log
          type: Directory
      # mainly for dockerd access via /var/run/docker.sock
      - name: var-run
        hostPath:
          path: /var/run
          type: Directory
      # var-run implies you also need this, because
      # /var/run is a synmlink to /run
      # sh-4.2$ ls -lah /var/run
      # lrwxrwxrwx 1 root root 6 Nov 14 07:22 /var/run -> ../run
      - name: run
        hostPath:
          path: /run
          type: Directory
      - name: usr-lib-systemd
        hostPath:
          path: /usr/lib/systemd
          type: Directory
      # Required by journalctl to locate the current boot.
      # If omitted, journalctl is unable to locate host's current boot journal
      - name: etc-machine-id
        hostPath:
          path: /etc/machine-id
          type: File
      # Avoid this error > ERROR [MessageGatewayService] Failed to add ssm-user to sudoers file: open /etc/sudoers.d/ssm-agent-users: no such file or directory
      - name: etc-sudoers
        hostPath:
          path: /etc/sudoers.d
          type: Directory

Summary

As you see, it’s relatively easy to manage Kubernetes nodes in a pure Kubernetes way, without taking unnecessary risks and managing complex SSH infrastructure.

Reference

20 Jul 2019, 10:00

EKS GPU Cluster from Zero to Hero

Introduction

If you ever tried to run a GPU workload on Kubernetes cluster, you know that this task requires non-trivial configuration and comes with high cost tag (GPU instances are quite expensive).

This post shows how to run a GPU workload on Kubernetes cluster in cost effective way, using AWS EKS cluster, AWS Auto Scaling, Amazon EC2 Spot Instances and some Kubernetes resources and configurations.

Kubernetes with GPU Mixed ASG

EKS Cluster Plan

First we need to create a Kubernetes cluster that consists from mixed nodes, non-GPU nodes for management and generic Kubernetes workload and more expensive GPU nodes to run GPU intensive tasks, like machine learning, medical analysis, seismic exploration, video transcoding and others.

These node groups should be able to scale on demand (scale out and scale in) for generic nodes, and from 0 to required number and back to 0 for expensive GPU instances. More than that, in order to do it in cost effective way, we are going to use Amazon EC2 Spot Instances both for generic nodes and GPU nodes.

AWS EC2 Spot Instances

With Amazon EC2 Spot Instances instances you can save up to 90% comparing to On-Demand price. Previously, Spot instances were terminated in ascending order of bids. The market prices fluctuated frequently because of this. In the current model, the Spot prices are more predictable, updated less frequently, and are determined by Amazon EC2 spare capacity, not bid prices. AWS EC2 service can reclaim SPot instances when there is not enough capacity for specific instance in specific Availability Zone. Spot instances receive a 2 minute alert when are about to be reclaimed by Amazon EC2 service and can use this time for graceful shutdown and state change.

The Workflow

Create EKS Cluster

It is possible to create AWS EKS cluster, using AWS EKS CLI, CloudFormation or Terraform, AWS CDK or eksctl.

eksctl CLI tool

In this post eksctl (a CLI tool for creating clusters on EKS) is used. It is possible to pass all parameters to the tool as CLI flags or configuration file. Using configuration file makes process more repeatable and automation friendly.

eksctl can create or update EKS cluster and additional required AWS resources, using CloudFormation stacks.

Customize your cluster by using a config file. Just run

eksctl create cluster -f cluster.yaml

to apply a cluster.yaml file:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: test-cluster
region: us-west-2

nodeGroups:
  - name: ng
    instanceType: m5.large
    desiredCapacity: 10

A new EKS cluster with 10 m5.large On-Demand EC2 worker nodes will be created and cluster credentials will be added to ~/.kube/config file.

Creating node groups

As planned, we are going to create two node groups for Kubernetes worker nodes:

  1. General node group - autoscaling group with Spot instances to run Kubernetes system workload and non-GPU workload
  2. GPU node groups - autoscaling group with GPU-powered Spot Instances, that can scale from 0 to required number of instances and back to 0.

Fortunately, the eksctl supports adding Kubernetes node groups to EKS cluster and these groups can be composed from Spot-only instances or mixture of Spot and On-Demand instances.

General node group

The eksctl configuration file contains EKS cluster in us-west-2 across 3 Availability Zones and the first General autoscaling (from 2 to 20 nodes) node group running on diversified Spot instances.

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: gaia-kube
  region: us-west-2

availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]

nodeGroups:
  # spot workers NG - multi AZ, scale from 3
  - name: spot-ng
    ami: auto
    instanceType: mixed
    desiredCapacity: 2
    minSize: 2
    maxSize: 20
    volumeSize: 100
    volumeType: gp2
    volumeEncrypted: true
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
      withAddonPolicies:
        autoScaler: true
        ebs: true
        albIngress: true
        cloudWatch: true
    instancesDistribution:
      onDemandPercentageAboveBaseCapacity: 0
      instanceTypes:
        - m4.2xlarge
        - m4.4xlarge
        - m5.2xlarge
        - m5.4xlarge
        - m5a.2xlarge
        - m5a.4xlarge
        - c4.2xlarge
        - c4.4xlarge
        - c5.2xlarge
        - c5.4xlarge
      spotInstancePools: 15
    tags:
      k8s.io/cluster-autoscaler/enabled: 'true'
    labels:
      lifecycle: Ec2Spot
    privateNetworking: true
    availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]

    ### next: GPU node groups ...

Now it is time to explain some parameters used in the above configuration file.

  • ami: auto - eksctl automatically discover latest EKS-Optimized AMI image for worker nodes, based on specified AWS region, EKS version and instance type. See Amazon EKS-Optimized AMI chapter in User Guide
  • instanceType: mixed - specify that actual instance type will be one of instance types defined in instancesDistribution section
  • iam contains list of predefined and in-place IAM policies; eksctl creates a new IAM Role with specified policies and attaches this role to every EKS worker node. There are several IAM policies you are required to attach to every EKS worker node, read Amazon EKS Worker Node IAM Role section in User Guide and eksctl IAM policies documentation
  • instancesDistribution - specify mixed instance policy for EC2 Auto Scaling Groups, read AWS MixedInstancesPolicy documentation
  • spotInstancePools - specifies number of Spot instance pools to use, read more
  • tags - AWS tags added to EKS worker nodes
    • k8s.io/cluster-autoscaler/enabled will use this tag for Kubernetes Cluster Autoscaler auto-discovery
  • privateNetworking: true - all EKS worker nodes will be placed into private subnets
Spot Instance Pools

When you are using Spot instances as worker nodes you need to diversify usage to as many Spot Instance pools as possible. A Spot Instance pool is a set of unused EC2 instances with the same instance type (for example, m5.large), operating system, Availability Zone, and network platforms.

The eksctl currently supports single Spot provisioning model: lowestPrice allocation strategy. This strategy allows creation of a fleet of Spot Instances that is both cheap and diversified. Spot Fleet automatically deploys the cheapest combination of instance types and Availability Zones based on the current Spot price across the number of Spot pools that you specify. This combination allows avoiding the most expensive Spot Instances.

The Spot instance diversification also increases worker nodes availability, typically not all Spot Instance pools will be interrupted at the same time, so only a small portion of your workload will be interrupted and EC2 Auto-scaling group will replace interrupted instances from others Spot Instance pools.

GPU-powered node group

The next part of our eksctl configuration file contains first GPU autoscaling (from 0 to 10 nodes) node group running on diversified GPU-powered Spot instances.

When using GPU-powered Spot instances, it’s recommended to create GPU node group per Availability Zone and configure Kubernetes Cluster Autoscaler to avoid automatic ASG rebalancing.

Why is it important? GPU-powered EC2 Spot Instances have relatively high Frequency of interruption rate (>20% for some GPU instance types) and using multiple AZ and disabling automatic Cluster Autoscaler balancing can help to minimize GPU workload interruptions.

  # ... EKS cluster and General node group ...

  # spot GPU NG - west-2a AZ, scale from 0
  - name: gpu-spot-ng-a
    ami: auto
    instanceType: mixed
    desiredCapacity: 0
    minSize: 0
    maxSize: 10
    volumeSize: 100
    volumeType: gp2
    volumeEncrypted: true
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
      withAddonPolicies:
        autoScaler: true
        ebs: true
        fsx: true
        efs: true
        albIngress: true
        cloudWatch: true
    instancesDistribution:
      onDemandPercentageAboveBaseCapacity: 0
      instanceTypes:
        - p3.2xlarge
        - p3.8xlarge
        - p3.16xlarge
        - p2.xlarge
        - p2.8xlarge
        - p2.16xlarge
      spotInstancePools: 5
    tags:
      k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true
      k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true'
      k8s.io/cluster-autoscaler/enabled: 'true'
    labels:
      lifecycle: Ec2Spot
      nvidia.com/gpu: 'true'
      k8s.amazonaws.com/accelerator: nvidia-tesla
    taints:
      nvidia.com/gpu: "true:NoSchedule"
    privateNetworking: true
    availabilityZones: ["us-west-2a"]

    # create additional node groups for other `us-west-2b` and `us-west-2c` availability zones ...

Now, it is time to explain some parameters used to configure GPU-powered node group.

  • ami: auto - eksctl automatically discover latest EKS-Optimized AMI image with GPU support for worker nodes, based on specified AWS region, EKS version and instance type. See Amazon EKS-Optimized AMI with GPU support User Guide
  • iam: withAddonPolicies - if a planned workload requires access to AWS storage services, it is important to include additional IAM policies (auto-generated by eksctl)
  • tags - AWS tags added to EKS worker nodes
    • k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true - Kubernetes node taint
    • k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true' - Kubernetes node label used by Cluster Autoscaler to scale ASG from/to 0
  • taints
    • nvidia.com/gpu: "true:NoSchedule" - Kubernetes GPU node taint; helps to avoid placement on non-GPU workload on expensive GPU nodes
EKS Optimized AMI image with GPU support

In addition to the standard Amazon EKS-optimized AMI configuration, the GPU AMI includes the following:

  • NVIDIA drivers
  • The nvidia-docker2 package
  • The nvidia-container-runtime (as the default runtime)
Scaling a node group to/from 0

From Kubernetes Cluster Autoscaler 0.6.1 - it is possible to scale a node group to/from 0, assuming that all scale-up and scale-down conditions are met.

If you are using nodeSelector you need to tag the ASG with a node-template key k8s.io/cluster-autoscaler/node-template/label/ and k8s.io/cluster-autoscaler/node-template/taint/ if you are using taints.

Scheduling GPU workload

Schedule based on GPU resources

The NVIDIA device plugin for Kubernetes exposes the number of GPUs on each nodes of your cluster. Once the plugin is installed, it’s possible to use nvidia/gpu Kubernetes resource on GPU nodes and for Kubernetes workloads.

Run this command to apply the Nvidia Kubernetes device plugin as a daemonset running only on AWS GPU-powered worker nodes, using tolerations and nodeAffinity

kubectl create -f kubernetes/nvidia-device-plugin.yaml

kubectl get daemonset -nkube-system

NAME                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
aws-node                              5         5         5       5            5           <none>          8d
kube-proxy                            5         5         5       5            5           <none>          8d
nvidia-device-plugin-daemonset-1.12   3         3         3       3            3           <none>          8d
ssm-agent                             5         5         5       5            5           <none>          8d

using nvidia-device-plugin.yaml Kubernetes resource file

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset-1.12
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: nvidia/k8s-device-plugin:1.11
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: beta.kubernetes.io/instance-type
                operator: In
                values:
                - p3.2xlarge
                - p3.8xlarge
                - p3.16xlarge
                - p3dn.24xlarge
                - p2.xlarge
                - p2.8xlarge
                - p2.16xlarge
Taints and Tolerations

Kubernetes taints allow a node to repel a set of pods. Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.

See Kubernetes Taints and Tolerations documentation for more details.

In order to run GPU workload to run on GPU-powered Spot instance nodes, with nvidia.com/gpu: "true:NoSchedule" taint, the workload must include both matching tolerations and nodeSelector.

Kubernetes deployment with 10 pod replicas with nvidia/gpu: 1 limit:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-vector-add
  labels:
    app: cuda-vector-add
spec:
  replicas: 10
  selector:
    matchLabels:
      app: cuda-vector-add
  template:
    metadata:
      name: cuda-vector-add
      labels:
        app: cuda-vector-add
    spec:
      nodeSelector:
        nvidia.com/gpu: "true"
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
        - name: cuda-vector-add
          # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
          image: "k8s.gcr.io/cuda-vector-add:v0.1"
          resources:
            limits:
              nvidia.com/gpu: 1 # requesting 1 GPU

Deploy cuda-vector-add deployment and see how new GPU-powered nodes are added to the EKS cluster.

# list Kubernetes nodes before running GPU workload
NAME                                            ID                                      TYPE
ip-192-168-151-104.us-west-2.compute.internal   aws:///us-west-2b/i-01d4c83eaee18b7b3   c4.4xlarge
ip-192-168-171-140.us-west-2.compute.internal   aws:///us-west-2c/i-07ec09fd128e1393f   c4.4xlarge


# deploy GPU workload on EKS cluster with tolerations for nvidia/gpu=true
kubectl create -f kubernetes/examples/vector/vector-add-dpl.yaml

# list Kubernetes nodes after several minutes to see new GPU nodes added to the cluster
kubectl get nodes --output="custom-columns=NAME:.metadata.name,ID:.spec.providerID,TYPE:.metadata.labels.beta\.kubernetes\.io\/instance-type"

NAME                                            ID                                      TYPE
ip-192-168-101-60.us-west-2.compute.internal    aws:///us-west-2a/i-037d1994fe96eeffc   p2.16xlarge
ip-192-168-139-227.us-west-2.compute.internal   aws:///us-west-2b/i-0038eb8d2c795fb40   p2.16xlarge
ip-192-168-151-104.us-west-2.compute.internal   aws:///us-west-2b/i-01d4c83eaee18b7b3   c4.4xlarge
ip-192-168-171-140.us-west-2.compute.internal   aws:///us-west-2c/i-07ec09fd128e1393f   c4.4xlarge
ip-192-168-179-248.us-west-2.compute.internal   aws:///us-west-2c/i-0bc0853ef26c0c054   p2.16xlarge

As you can see, 3 new GPU-powered nodes (p2.16xlarge), across 3 AZ, had been added to the cluster. When you delete GPU workload, the cluster will scale down GPU node group to 0 after 10 minutes.

Summary

Follow this tutorial to create an EKS (Kubernetes) cluster with GPU-powered node group, running on Spot instances and scalable from/to 0 nodes.

References

Disclaimer

It does not matter where I work, all my opinions are my own.

08 Mar 2019, 10:00

Kubernetes Continuous Integration

Kubernetes configuration as Code

Complex Kubernetes application consists from multiple Kubernetes resources, defined in YAML files. Authoring a properly formatted YAML files that are also a valid Kubernetes specification, that should also comply to some policy can be a challenging task.

These YAML files are your application deployment and configuration code and should be addressed as code.

As with code, Continuous Integration approach should be applied to a Kubernetes configuration files.

Git Repository

Create a separate Git repository that contains Kubernetes configuration files. Define a Continuous Integration pipeline that is triggered automatically for for every change and can validate it without human intervention.

Helm

Helm helps you manage complex Kubernetes applications. Using Helm Charts you define, install, test and upgrade Kubernetes application.

Here I’m going to focus on using Helm for authoring complex Kubernetes application.

The same Kubernetes application can be installed in multiple environments: development, testing, staging and production. Helm template helps to separate application structure from environment configuration by keeping environment specific values in external files.

Dependency Management

Helm also helps with dependency management. A typical Kubernetes application can be composed from multiple services developed by other teams and open source community.

A requirements.yaml file is a YAML file in which developers can declare Helm chart dependencies, along with the location of the chart and the desired version. For example, this requirements file declares two dependencies:

Where possible, use version ranges instead of pinning to an exact version. The suggested default is to use a patch-level version match:

version: ~1.2.3

This will match version 1.2.3 and any patches to that release. In other words, ~1.2.3 is equivalent to >= 1.2.3, < 1.3.0

YAML

YAML is the most convenient way to write Kubernetes configuration files. YAML is easier for humans to read and write than other common data formats like XML or JSON. Still it’s recommended to use automatic YAML linters to avoid syntax and formatting errors.

yamlint

Helm Chart Validation

Helm has a helm lint command that runs a series of tests to verify that the chart is well-formed. The helm lint also converts YAML to JSON, and this ways is able to detect some YAML errors.

helm lint mychart

==> Linting mychart
[ERROR] templates/deployment.yaml: unable to parse YAML
    error converting YAML to JSON: yaml: line 53: did not find expected '-' indicator

There are few issues with helm lint, you should be aware of: - no real YAML validation is done: only YAML to JSON conversion errors are detected - it shows wrong error line number: no the actual line in a template that contains the detected error

So, I recommend also to use YAML linter, like yamllint to perform YAML validation.

First, you need to generate Kubernetes YAML files from a Helm chart. The helm template renders chart templates locally and prints the output to the stdout.

helm template --namespace test --values dev-values.yaml

Pipe helm template and yamllint together to validate rendered YAML files.

helm template mychart | yamllint -

stdin
  41:81     error    line too long (93 > 80 characters)  (line-length)
  43:1      error    trailing spaces  (trailing-spaces)
  151:9     error    wrong indentation: expected 10 but found 8  (indentation)
  245:10    error    syntax error: expected <block end>, but found '<block sequence start>'
  293:1     error    too many blank lines (1 > 0)  (empty-lines)

Now there are multiple ways to inspect these errors:

# using vim editor
vim <(helm template mychart)

# using cat command (with line number)
cat -n <(helm template mychart)

# printing error line and few lines around it, replacing spaces with dots
helm template hermes | sed -n 240,250p | tr ' ' '⋅'

Valid Kubernetes Configuration

When authoring Kubernetes configuration files, it’s important not only check if they are valid YAML files, but if they are valid Kubernetes files.

It turns out that the Kubernetes supports OpenAPI specification and it’s possible to generate Kubernetes JSON schema for every Kubernetes API version.

Gareth Rushgrove wrote a blog post on this topic and maintains the garethr/kubernetes-json-schema GitHub repository with most recent Kubernetes and OpenShift JSON schemas for all API versions. What a great contribution to the community!

Now, with Kubernetes API JSON schema, it’s possible to validate any YAML file if it’s a valid Kubernetes configuration file.

The kubeval tool (also authored by Gareth Rushgrove) is to help.

Run kubeval piped with helm template command.

helm template mychart | kubeval

The output below sows single detected error in Service definition: invalid annotation. The kubeval could be more specific, by providing Service name, but event AS IS it is a valuable output for detecting Kubernetes configuration errors.

The document stdin contains a valid Secret
The document stdin contains a valid Secret
The document stdin contains a valid ConfigMap
The document stdin contains a valid ConfigMap
The document stdin contains a valid PersistentVolumeClaim
The document stdin contains an invalid Service
---> metadata.annotations: Invalid type. Expected: object, given: null
The document stdin contains a valid Service
The document stdin contains a valid Deployment
The document stdin contains a valid Deployment
The document stdin is empty
The document stdin is empty
The document stdin is empty
The document stdin is empty

https://github.com/garethr/kubetest

11 Oct 2017, 18:00

Continuous Delivery and Continuous Deployment for Kubernetes microservices

Continuous Delivery and Continuous Deployment for Kubernetes microservices

THIS IS A DRAFT VERSION OF POST TO COME, PLEASE DO NOT SHARE

Starting Point

Over last years we’ve been adopting several concepts for our project, straggling to make them work together.

The first one is the Microservice Architecture. We did not start it clean and by the book, rather applied it to the already existing project: splitting big services into smaller and breaking excessive coupling. The refactoring work is not finished yet. New services, we are building, starts looking more like “microservices”, while there are still few that, I would call “micro-monoliths”. I have a feeling that this is a typical situation for an already existing project, that tries to adopt this new architecture pattern: you are almost there, but there is always a work to be done.

Another concept is using Docker for building, packaging and deploying application services. We bet on Docker from the very beginning and used it for most of our services and it happens to be a good bet. There are still few pure cloud services, that we are using when running our application on public cloud, thing like Databases, Error Analytics, Push Notifications and some others.

And one of the latest bet we made was Kubernetes. Kubernetes became the main runtime platform for our application. Adopting Kubernetes, allowed us not only to hide away lots of operational complexity, achieving better availability and scalability, but also be able to run our application on any public cloud and on-premise deployment.

With great flexibility, that Kubernetes provides, it brings an additional deployment complexity. Suddenly your services are not just plain Docker containers, but there are a lot of new (and useful) Kubernetes resources that you need to take care for: ConfigMsaps, Secrets, Services, Deployments, StatefulSets, PVs, PVCs, Ingress, Jobs and others. And it’s not always obvious where to keep all these resources and how they are related to Docker images built by CI tool.

“Continuous Delivery” vs. “Continuous Deployment”

The ambiguity of CD term annoys me a lot. Different people mean different things when using this term. And it’s not only about abbreviation meaning: Continuous Deployment vs Continuous Delivery, but also what do people really mean, when using this abbreviation.

Still, it looks like there is a common agreement that Continuous Deployment (CD) is a super-set of Continuous Delivery (CD). And the main difference, so far, is that Continuous Deployment is 100% automated, while in Continuous Delivery there are still some steps that should be done manually.

In our project, for example, we succeeded to achieve Continuous Delivery, that serves us well both for SaaS and on-premise versions of our product. Our next goal is to create fluent Continuous Deployment for SaaS version. We would like to be able release a change automatically to production, without human intervention, and be able to rollback to the previous version if something went wrong.

Kubernetes Application and Release Content

Now let’s talk about Release and try to define what is a Release Content?.

When we are releasing a change to some runtime environment (development, staging or production), it’s not always a code change, that is represented by a newly backed Docker image with some tag. Change can be done to application configurations, secrets, ingress rules, jobs we are running, volumes and other resources. It would be nice to be able to release all these changes in the same way as we release a code change. Actually, a change can be a mixture of both and in practice, it’s not a rare use case.

So, we need to find a good methodology and supporting technology, that will allow us to release a new version of our Kubernetes application, that might be composed of multiple changes and these changes are not only new Docker image tags. This methodology should allow us to do it repeatedly on any runtime environment (Kubernetes cluster in our case) and be able to rollback ALL changes to the previous version if something went wrong.

That’s why we adopted Helm as our main release management tool for Kubernetes.

Helm recap

This post is not about Helm, so Helm recap will be very short. I encourage you to read Helm documentation, it’s complete and well written.

Just to remind - core Helm concepts are:

  • (Helm) Chart - is a package (tar archive) with Kubernetes YAML templates (for different Kubernetes resources) and default values (also stored in YAML files). Helm uses chart to install a new or update an existing (Helm) release.
  • (Helm) Release - is a Kubernetes application instance, installed with Helm. It is possible to create multiple releases from the same chart version.
  • (Release) Revision - when updating an existing release, a new revision is created. Helm can rollback a release to the previous revision. Helm stores all revisions in ConfigMap and it’s possible to list previous releases with helm history command.
  • Chart Repository - is a location where packaged charts can be stored and shared. Any web server that can store and serve static files can be used as Chart Repository (Nginx, GitHub, AWS S3 and others).

Helm consists of the server, called Tiller and the command line client, called helm. When releasing a new version (or updating an existing) helm client sends chart (template files and values) to the Helm server. Tiller server generates valid Kubernetes yaml files from templates and values and deploys them to Kubernetes, using Kubernetes API. Tiller also saves generated yaml files as a new revision inside ConfigMaps and can use previously saved revision for rollback operation.

It was a short recap. Helm is a flexible release management system and can be extended with plugins and hooks.

Helm Chart Management

Typical Helm chart contains a list of template files (yaml files with go templates commands) and values files (with configurations and secrets).

We use Git to store all our Helm chart files and Amazon S3 for chart repository.

Short How-To guide:

  1. Adopt some Git management methodology. We use something very close to the GitHub Flow model GitHub Flow
  2. Have a git repository for each microservice. Our typical project structure:

    # chart files
    chart/
        # chart templates
        templates/
        # external dependency
        requirements.yaml
        # default values
        values.yaml
        # chart definition
        Chart.yaml
    # source code
    scr/
    # test code
    test/
    # build scripts
    hack/
    # multi-stage Docker build file
    Dockerfile
    # Codefresh CI/CD pipeline
    codefresh.yaml
    
    1. We keep our application chart in a separate git repository. The application chart does not contain templates, but only list of third party charts it needs (requirements.yaml file) and values files for different runtime environments (testing, staging and production)
    2. All secrets in values files are encrypted with sops tool and we defined a .gitignore file and setup a git pre-commit hook to avoid unintentional commit of decrypted secrets.

    Docker Continuous Integration

    Building and testing code on git push/tag event and packaging it into some build artifact is a common knowledge and there are tons of tools, services, and tutorials how to do it.

    Codefresh is one of such services, which is tuned effectively build Docker images.

    Codefresh Docker CI has one significant benefit versus other similar services - besides just being fast CI for Docker, it maintains a traceability links between git commits, builds, Docker images and Helm Releases running on Kubernetes clusters.

    Typical Docker CI flow

    Docker CI

    1. Trigger CI pipeline on push event
    2. Build and test service code. Tip: give a try to a Docker multi-stage build.
    3. Tip: Embed the git commit details into the Docker image (using Docker labels). I suggest following Label Schema convention.
    4. Tag Docker image with {branch}-{short SHA}
    5. Push newly created Docker image into preferred Docker Registry

    Docker multistage build

    With a Docker multi-stage build, you can even remove a need to learn a custom CI DSL syntax, like Jenkins Job/Pipeline, or other YAML based DSL. Just use a familiar Dockerfile imperative syntax to describe all required CI stages (build, lint, test, package) and create a thin and secure final Docker image, that contains only bare minimum, required to run the service.

    Using multi-stage Docker build, has other benefits.

    It allows you to use the same CI flow both on the developer machine and the CI server. It can help you to switch easily between different CI services, using the same Dockerfile. The only thing you need is a right Docker daemon version (‘> 17.05’). So, select CI service that supports latest Docker daemon versions.

    Example: Node.js multi-stage Dockerfile

    #
    # ---- Base Node ----
    FROM alpine:3.5 AS base
    # install node
    RUN apk add --no-cache nodejs-npm tini
    # set working directory
    WORKDIR /root/chat
    # Set tini as entrypoint
    ENTRYPOINT ["/sbin/tini", "--"]
    # copy project file
    COPY package.json .
    
    #
    # ---- Dependencies ----
    FROM base AS dependencies
    # install node packages
    RUN npm set progress=false && npm config set depth 0
    RUN npm install --only=production 
    # copy production node_modules aside
    RUN cp -R node_modules prod_node_modules
    # install ALL node_modules, including 'devDependencies'
    RUN npm install
    
    #
    # ---- Test ----
    # run linters, setup and tests
    FROM dependencies AS test
    COPY . .
    RUN  npm run lint && npm run setup && npm run test
    
    #
    # ---- Release ----
    FROM base AS release
    # copy production node_modules
    COPY --from=dependencies /root/chat/prod_node_modules ./node_modules
    # copy app sources
    COPY . .
    # expose port and define CMD
    EXPOSE 5000
    CMD npm run start
    

Kubernetes Continuous Delivery (CD)

Building Docker image on git push is a very first step you need to automate, but …

Docker Continuous Integration is not a Kubernetes Continuous Deployment/Delivery

After CI completes, you just have a new build artifact - a Docker image file.

Now, somehow you need to deploy it to a desired environment (Kubernetes cluster) and maybe also need to modify other Kubernetes resources, like configurations, secrets, volumes, policies, and others. Or maybe you do not have a “pure” microservice architecture and some of your services still have some kind of inter-dependency and have to be released together. I know, this is not “by the book”, but this is a very common use case: people are not perfect and not all architectures out there perfect too. Usually, you start from an already existing project and try to move it to a new ideal architecture step by step.

So, on one side, you have one or more freshly backed Docker images. On the other side, there are one or more environments where you want to deploy these images with related configuration changes. And most likely, you would like to reduce required manual effort to the bare minimum or dismiss it completely, if possible.

Continuous Delivery is the next step we are taking. Most of the CD tasks should be automated, while there still may be a few tasks that should be done manually. The reason for having manual tasks can be different: either you cannot achieve full automation or you want to have a feeling of control (deciding when to release by pressing some “Release” button), or there is some manual effort required (bring the new server and switch in on :) )

For our Kubernetes Continuous Delivery pipeline, we manually update Codefresh application Helm chart with appropriate image tags and sometimes we also update different Kubernetes YAML template files too (defining a new PVC or environment variable). Once changes to our application chart are pushed into the git repository, an automated Continuous Delivery pipeline execution is triggered.

Codefresh includes some helper steps that make building Kubernetes CD pipeline easier. First, we have a built-in helm update step that can install or update a Helm chart on specified Kubernetes cluster or namespace, using Kubernetes context, defined in Codefresh account.

Codefresh also provides a nice view of what is running in your Kubernetes cluster, where it comes from (release, build) and what does it contain: images, image metadata (quality, security, etc.), code commits.

We use our own service (Codefresh) to build an effective Kubernetes Continuous Delivery pipeline for deploying Codefresh itself. We also constantly add new features and useful functionality that simplify our life (as developers) and hopefully help our customers too.

Codefresh Helm Release View

Typical Kubernetes Continuous Delivery flow

Kubernetes Continuous Delivery

  1. Setup a Docker CI for the application microservices
  2. Update microservice/s code and chart template files, if needed (adding ports, env variables, volumes, etc.)
  3. Wait till Docker CI completes and you have a new Docker image for updated microservice/s
  4. Manage the application Helm chart code in separate git repository; use the same git branch methodology as for microservices
  5. Manually update imageTags for updated microservice/s
  6. Manually update the application Helm chart version
  7. Trigger CD pipeline on git push event for the application Helm chart git repository
    • validate Helm chart syntax: use helm lint
    • convert Helm chart to Kubernetes template files (with helm template plugin) and use kubeval to validate these files
    • package the application Helm chart and push it to the Helm chart repository
      • Tip: create few chart repositories; I suggest having a chart repository per environment: production, staging, develop
  8. Manually (or automatically) execute helm upgrade --install from corresponding chart repository

After CD completes, we have a new artifact - an updated Helm chart package (tar archive) of our Kubernetes application with a new version number.

Now, we can run help upgrade --install command creating a new revision for the application release. If something goes wrong, we can always rollback failed release to the previous revision. For the sake of safety, I suggest first to run helm diff (using helm diff plugin) or at least use a --dry-run flag for the first run, inspect the difference between a new release version and already installed revision. If you are ok with upcoming changes, accept them and run the helm upgrade --install command without --dry-run flag.

Kubernetes Continuous Deployment (CD)

Based on above definition, to achieve Continuous Deployment we should try to avoid all manual steps, besides git push for code and configuration changes. All actions, running after git push, should be 100% automated and deliver all changes to a corresponding runtime environment.

Let’s take a look at manual steps from “Continuous Delivery” pipeline and think about how can we automate them?

Kubernetes Continuous Deployment

Automate: Update microservice imageTag after successful docker push

After a new Docker image for some microservice pushed to a Docker Registry, we would like to update the microservice Helm chart with the new Docker image tag. There are two (at least) options to do this.

  1. Add a Docker Registry WebHook handler (for example, using AWS Lambda). Take the new image tag from the DockerHub push event payload and update corresponding imageTag in the Application Helm chart. For GitHub, we can use GitHub API to update a single file or bash scripting with mixture of sed and git commands.
  2. Add an additional step to every microservice CI pipeline, after docker push step, to update a corresponding imageTag for the microservice Helm chart

Automate: Deploy Application Helm chart

After a new chart version uploaded to a chart repository, we would like to deploy it automatically to “linked” runtime environment and rollback on failure.

Helm chart repository is not a real server that aware of deployed charts. It is possible to use any Web server that can serve static files as a Helm chart repository. In general, I like simplicity, but sometimes it leads to naive design and lack of basic functionality. With Helm chart repository it is the case. Therefore, I recommend using a web server that supports nice API and allows to get notifications about content change without pull loop. Amazon S3 can be a good choice for Helm chart repository

Once you have a chart repository up and running and can get notifications about a content update (as WebHook or with pool loop), and make next steps towards Kubernetes Continuous Deployment.

  1. Get updates from Helm chart repository: new chart version
  2. Run helm update --install command to update/install a new application version on “linked” runtime environment
  3. Run post-install and in-cluster integration tests
  4. Rollback to the previous application revision on any “failure”

Summary

This post describes our current Kubernetes Continuous Delivery” pipeline we succeeded to setup. There are still things we need to improve and change in order to achieve fully automated Continuous Deployment.

We constantly change Codefresh to be the product that helps us and our customers to build and maintain effective Kubernetes CD pipelines. Give it a try and let us know how can we improve it.


Hope, you find this post useful. I look forward to your comments and any questions you have.


*This is a working draft version. The final post version is published at Codefresh Blog on December 4, 2018.*

04 Oct 2017, 18:00

Chaos Testing for Docker Containers

What follows is the text of my presentation, Chaos Testing for Docker Containers that I gave at ContainerCamp in London this year. I’ve also decided to turn the presentation into an article. I edited the text slightly for readability and added some links for more context. You can find the original video recording and slides at the end of this post.

Docker Chaos Testing

Intro

Software development is about building software services that support business needs. More complex businesses processes we want to automate and integrate with. the more complex software system we are building. And solution complexity is tend to grown over time and scope.

The reasons for growing complexity can be different. Some systems just tend to handle too many concerns, or require a lot of integrations with external services and internal legacy systems. These systems are written and rewritten multiple times over several years by different people with different skills, trying to satisfy constantly changing business requirements, using different technologies, following different technology and architecture trends.

So, my point, is that building software, that unintentionally become more and more complex over time, is easy - we all done in the past it or doing it right now. Building a “good” software architecture for complex systems and preserving it’s “good” abilities for some period of time, is really hard.

When you have too many “moving” parts, integrations, constantly changing requirements, that lead to code changes, security upgrades, hardware modernization, multiple network communication channels and etc, it can become a “Mission Impossible” to avoid unexpected failures.

Stuff happens!

All systems fail from time to time. And your software system will fail too. Take this as a fact of life. There will always be something that can — and will — go wrong. No matter how hard we try, we can’t build perfect software, nor can the companies we depend on. Even the most stable and respectful services from companies, that practice CI/CD, test driven development (TDD/BDD), have huge QA departments and well defined release procedures, fail.

Just a few examples from the last year outages:

  1. BM, January 26
    • IBM’s cloud credibility took a hit at the start of the year when a management portal used by customers to access its Bluemix cloud infrastructure went down for several hours. While no underlying infrastructure actually failed, users were frustrated in finding they couldn’t manage their applications or add or remove cloud resources powering workloads.
    • IBM said the problem was intermittent and stemmed from a botched update to the interface.
  2. GitLab, January 31
    • GitLab’s popular online code repository, GibLab.com, suffered an 18-hour service outage that ultimately couldn’t be fully remediated.
    • The problem resulted when an employee removed a database directory from the wrong database server during maintenance procedures.
  3. AWS, February 28
    • This was the outage that shook the industry.
    • An Amazon Web Services engineer trying to debug an S3 storage system in the provider’s Virginia data center accidentally typed a command incorrectly, and much of the Internet – including many enterprise platforms like Slack, Quora and Trello – was down for four hours.
  4. Microsoft Azure, March 16
    • Storage availability issues plagued Microsoft’s Azure public cloud for more than eight hours, mostly affecting customers in the Eastern U.S.
    • Some users had trouble provisioning new storage or accessing existing resources in the region. A Microsoft engineering team later identified the culprit as a storage cluster that lost power and became unavailable.

Visit Outage.Report or Downdetector to see a constantly updating long list of outages reported by end-users.

Chasing Software Quality

As software engineers, we what to be proud of software systems we are building. We want theses systems to be of high quality, without functional bugs, security holes, providing exceptional performance, resilient to unexpected failures, self-healing, always available and easy to maintain and modernize.

Every new project starts with “high quality” picture in mind and none wants to create crappy software, but very few of us (or none) are able to achieve and keep intact all good “abilities”. So, what we can do to improve overall system quality? Should we do more testing?

I tend to say “Yes” - software testing is critical. But just running unit, functional and performance testing is not enough.

Today, building complex distributed system is much easier with all new amazing technology we have and experience we gathered. Microservice Architecture is a real trend nowadays and miscellaneous container technologies support this architecture. It’s much easier to deploy, scale, link, monitor, update and manage distributed systems, composed from multiple “microservices”. When we are building distributed systems, we are choosing P (Pratition Tolerance) from the CAP theorem and second to it either A (Availability - the most popular choice) or C (Consistency). So, we need to find a good approach for testing AP or CP systems.

Traditional testing disciplines and tools do not provide a good answer to how does your distributed system behave when unexpected stuff happens in production?. Sure, you can learn from previous failures, after the fact, and you should definitely do it. But, learning from past experience should not be the only way to prepare for the future failures.

Waiting for things to break in production is not an option. But what’s the alternative?

Chaos Engineering

The alternative is to break things on purpose. And Chaos Engineering is a particular approach to doing just that. The idea of Chaos Engineering is to embrace the failure! Chaos Engineering for distributed software systems was originally popularized by Netflix.

Chaos Engineering defines an empirical approach to resilience testing of distributed software systems. You are testing a system by conducting chaos experiments.

Typical chaos experiment: - define a normal/steady state of the system (e.g. by monitoring a set of system and business metrics) - pseudo-randomly inject faults (e.g. by terminating VMs, killing containers or changing network behavior) - try to discover system weaknesses by deviation from expected or steady-state behavior

The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system.  

Chaos Engineering tools

Of cause it’s possible to practice Chaos Engineering manually, or relay on automatic system updates, but we, as engineers like to automate boring manual tasks, so there are some nice tools to use.

Netflix built a some useful tools for practicing Chaos Engineering in public cloud (AWS): - Chaos Monkey - kill EC2, kill processes, burn CPU, fill disk, detach volumes, add network latency, etc - Chaos Kong - remove whole AWS Regions

These are very good tools, I encourage you to use them. But when I’ve started my new container based project (2 years ago), it looks like these tools provided just a wrong granularity for chaos I wanted to create, and I wanted to be able to create the chaos not only in real cluster, but also on single developer machine, to be able to debug and tune my application. So, I’ve searched Google for Chaos Monkey for Docker, but did not find anything, besides some basic Bash scripts. So, I’ve decided to create my own tool. And since it happens to be quite a useful tool from the very first version, I’ve shared it with a community as an open source. It’s a Chaos Monkey Warthog for Docker - Pumba

Pumba - Chaos Testing for Docker

What is Pumba(a)?

Those of us who have kids or was a kid in 90s should remember this character from Disney’s animated film The Lion King. In Swahili, pumbaa means “to be foolish, silly, weak-minded, careless, negligent”. I like the Swahili meaning of this word. It matched perfectly for the tool I wanted to create.

What Pumba can do?

Pumba disturbs running Docker runtime environment by injecting different failures. Pumba can kill, stop, remove or pause Docker container. Pumba can also do a network emulation, simulating different network failures, like: delay, packet loss (using different probability loss models), bandwidth rate limits and more. For network emulation, Pumba uses Linux kernel traffic control tc with netem queueing discipline, read more here. If tc is not available within target container, Pumba uses a sidekick container with tc on-board, attaching it to the target container network.

You can pass list of containers to the Pumba or just write a regular expression to select matching containers. If you will not specify containers, Pumba will try to disturb all running containers. Use --random option, to randomly select only one target container from the provided list. It’s also possible to define a repeatable time interval and duration parameters to better control the amount of chaos you want to create.

Pumba is available as a single binary file for Linux, MacOS and Windows, or as a Docker container.

# Download binary from https://github.com/gaia-adm/pumba/releases
curl https://github.com/gaia-adm/pumba/releases/download/0.4.6/pumba_linux_amd64 --output /usr/local/bin/pumba
chmod +x /usr/local/bin/pumba && pumba --help

# Install with Homebrew (MacOS only)
brew install pumba && pumba --help

# Use Docker image
docker run gaiaadm/pumba pumba --help

Pumba commands examples

First of all, run pumba --help to get help about available commands and options and pumba <command> --help to get help for the specific command and sub-command.

# pumba help
pumba --help

# pumba kill help
pumba kill --help

# pumba netem delay help
pumba netem delay --help

Killing randomly chosen Docker container from ^test regex list.

# on main pane/screen, run 7 test containers that do nothing
for i in {0..7}; do docker run -d --rm --name test$i alpine tail -f /dev/null; done
# run an additional container with 'skipme' name
docker run -d --rm --name skipme alpine tail -f /dev/null

# run this command in another pane/screen to see running docker containers
watch docker ps -a

# go back to main pane/screen and kill (once in 10s) random 'test' container, ignoring 'skipme'
pumba --random --interval 10s kill re2:^test
# press Ctrl-C to stop Pumba at any time

Adding a 3000ms (+-50ms) delay to the engress traffic for the ping container for 20 seconds, using normal distribution model.

# run "ping" container on one screen/pane
docker run -it --rm --name ping alpine ping 8.8.8.8

# on second screen/pane, run pumba netem delay command, disturbing "ping" container; sidekick a "tc" helper container
pumba netem --duration 20s --tc-image gaiadocker/iproute2 delay --time 3000 jitter 50 --distribution normal ping
# pumba will exit after 20s, or stop it with Ctrl-C

To demonstrate packet loss capability, we will need three screens/panes. I will use iperf network bandwidth measurement tool. On the first pane, run server docker container with iperf on-board and start there a UDP server. On the second pane, start client docker container with iperf and send datagrams to the server container. Then, on the third pane, run pumba netem loss command, adding a packet loss to the client container. Enjoy the chaos.

# create docker network
docker network create -d bridge testnet

# > Server Pane
# run server container
docker run -it --name server --network testnet --rm alpine sh -c "apk add --no-cache iperf; sh"
# shell inside server container: run a UDP Server listening on UDP port 5001
sh$ iperf -s -u -i 1

# > Client Pane
# run client container
docker run -it --name client --network testnet --rm alpine sh -c "apk add --no-cache iperf; sh"
# shell inside client container: send datagrams to the server -> see no packet loss
sh$ iperf -c server -u

# > Server Pane
# see server receives datagrams without any packet loss

# > Pumba Pane
# inject 20% packet loss into client container, for 1m
pumba netem --duration 1m --tc-image gaiadocker/iproute2 loss --percent 20 client

# > Client Pane
# shell inside client container: send datagrams to the server -> see ~20% packet loss
sh$ iperf -c server -u

Session and slides

Slides


Hope, you find this post useful. I look forward to your comments and any questions you have.

03 Jun 2017, 18:00

Debugging remote Node.js application running in a Docker container

Teaser

Suppose you want to debug a Node.js application already running on a remote machine inside Docker container. And would like to do it without modifying command arguments (enabling debug mode) and opening remote Node.js debugger agent port to the whole world.

I bet you didn’t know that it’s possible and also have no idea how to do it.

I encourage you to continue reading this post if you are eager to learn some new cool stuff.

The TodoMVC demo application

I’m using the fork of TodoMVC Node.js application (by Gleb Bahmutov) as a demo application for this blog post. Feel free to clone and play with this repository.

Here is the Dockerfile, I’ve added, for TodoMVC application. It allows to run TodoMVC application inside a Docker container.

FROM alpine:3.5

# install node
RUN apk add --no-cache nodejs-current tini

RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app

# Build time argument to set NODE_ENV ('production'' by default)
ARG NODE_ENV
ENV NODE_ENV ${NODE_ENV:-production}

# install npm packages: clean obsolete files
COPY package.json /usr/src/app/
RUN npm config set depth 0 && \
    npm install && \
    npm cache clean && \
    rm -rf /tmp/*

# copy source files
COPY . /usr/src/app

EXPOSE 3000

# Set tini as entrypoint
ENTRYPOINT ["/sbin/tini", "--"]

CMD [ "npm", "start" ]

# add VCS labels for code sync and nice reports
ARG VCS_REF="local"
LABEL org.label-schema.vcs-ref=$VCS_REF \          
      org.label-schema.vcs-url="https://github.com/alexei-led/todomvc-express.git"

Building and Running TodoMVC in a Docker container:

To build a new Docker image for TodoMVC application, run the docker build command.

$ # build Docker image; set VCS_REF to current HEAD commit (short)
$ docker build -t local/todomvc --build-arg VCS_REF=`git rev-parse --short HEAD` .
$ # run TodoMVC in a Docker container
$ docker run -d -p 3000:3000 --name todomvc local/todomvc node src/start.js

The Plan

Final Goal - I would like to be able to attach a Node.js debugger to a Node.js application already up and running inside a Docker container, running on remote host machine in AWS cloud, without modifying the application, container, container configuration or restarting it with additional debug flags. Imagine that the application is running and there is some problem happening right now - I want to connect to it with debugger and start looking at the problem.

So, I need a plan - a step by step flow that will help me to achieve the final goal.

Let’s start with exploring the inventory. On the server (AWS EC2 VM) machine, I have a Node.js application running inside a Docker container. On the client (my laptop), I have an IDE (Visual Studio Code, in my case), Node.js application code (git pull/clone) and a Node.js debugger.

So, here is my plan:

  1. Set already running application to debug mode
  2. Expose a new Node.js debugger agent port to enable remote debugging in a secure way
  3. Syncronize client-server code: both should be on the same commit in git tree
  4. Attach a local Node.js debugger to the Node.js debugger agent port on remote server and do it in a secure way
  5. And, if everything works, I should be able to perform regular debugging tasks, like setting breakpoints, inspecting variables, pausing execution and others.

Debug Node in Docker

Step 1: set already running Node.js application to the debug mode

The V8 debugger can be enabled and accessed either by starting Node with the --debug command-line flag or by signaling an existing Node process with SIGUSR1. (Node API documentation)

Cool! So, in order to switch on Node debugger agent, I just need to send the SIGUSR1 signal to the Node.js process of TodoMVC application. Remember, it’s running inside a Docker container. What command can I use to send process signals to an application running in a Docker container?

The docker kill - is my choice! This command does not actually “kill” the PID 1 process, running in a Docker container, but sends a Unix signal to it (by default it sends SIGKILL).

Setting TodoMVC into debug mode

So, all I need to do is to send SIGUSR1 to my TodoMVC application running inside todomvc Docker container.

There are two ways to do this:

  1. Use docker kill --signal command to send SIGUSR1 to PID 1 process running inside Docker container, and if it’s a “proper” (signal forwarding done right) init application (like tini), than this will work
  2. Or execute kill -s SIGUSR1 inside already running Docker container, sending SIGUSR1 signal to the main Node.js process.
$ # send SIGUSR1 with docker kill (if using proper init process)
$ docker kill --signal SIGUSR1 todomvc 
$ # OR run kill command for node process inside todomvc container
$ docker exec -it todomvc sh -c 'kill -s SIGUSR1 $(pidof -s node)'

Let’s verify that Node application is set into debug mode.

$ docker logs todomvc

TodoMVC server listening at http://:::3000
emitting 2 todos
server has new 2 todos
GET / 200 31.439 ms - 3241
GET /app.css 304 4.907 ms - -
Starting debugger agent.
Debugger listening on 127.0.0.1:5858

As you can see the Node.js debugger agent was started, but it can accept connections only from the localhost, see the last output line: Debugger listening on 127.0.0.1:5858

Step 2: expose Node debug port

In order to attach a remote Node.js debugger to a Node application, running in the debug mode, I need:

  1. Allow connection to debugger agent from any (or specific) IP (or IP range)
  2. Open port of Node.js debugger agent outside of Docker container

How to do it when an application is already running in a Docker container and a Node.js debugger agent is ready to talk only with a Node.js debugger running on the same machine, plus a Node.js debugger agent port is not accessible from outside of the Docker container?

Of cause it’s possible to start every Node.js Docker container with exposed debugger port and allow connection from any IP (using --debug-port and --debug Node.js flags), but we are not looking for easy ways :).

It’s not a good idea from a security point of view (allowing unprotected access to a Node.js debugger). Also, if I restart an already running application with debug flags, I’m going to loose the current execution context and may not be able to reproduce the problem I wanted to debug.

I need a better solution!

Unfortunately, Docker does not allow to expose an additional port of already running Docker container. So, I need somehow connect to a running container network and expose a new port for Node.js debugger agent.

Also, it is not possible to tell a Node.js debugger agent to accept connections from different IP addresses, when Node.js process was already started.

Both of above problems can be solved with help of the small Linux utility called socat (SOcket CAT). This is just like the netcat but with security in mind (e.g., it support chrooting) and works over various protocols and through files, pipes, devices, TCP sockets, Unix sockets, a client for SOCKS4, proxy CONNECT, or SSL etc.

From socat man page: > socat is a command line based utility that establishes two bidirectional byte streams and transfers data between them. Because the streams can be constructed from a large set of different types of data sinks and sources (see address types), and because lots of address options may be applied to the streams, socat can be used for many different purposes.

Exactly, what I need!

So, here is the plan. I will run a new Docker container with the socat utility onboard, and configure Node.js debugger port forwarding for TodoMVC container.

socat.Dockerfile:

FROM alpine:3.5
RUN apk add --no-cache socat
CMD socat -h

Building socat Docker container

$ docker build -t local/socat - < socat.Dockerfile

Allow connection to Node debugger agent from any IP

I need to run a “sidecar” socat container in the same network namespace as the todomvc container and define a port forwarding.

$ # define local port forwarding
$ docker run -d --name socat-nid --network=container:todomvc local/socat socat TCP-LISTEN:4848,fork TCP:127.0.0.1:5858

Now any traffic that arrives at 4848 port will be routed to the Node.js debugger agent listening on 127.0.0.1:5858. The 4848 port can accept traffic from any IP. It’s possible to use an IP range to restrict connection to the socat listening port, adding range=<ANY IP RANGE> option.

Exposing Node.js debugger port from Docker container

First, we will get IP of todomvc Docker container.

$ # get IP of todomvc container
$ TODOMVC_IP=$(docker inspect -f "{{.NetworkSettings.IPAddress}}" todomvc)

Then, configure port forwarding to the “sidecar” socat port, we define previously, running on the same network as the todomvc container.

$ # run socat container to expose Node.js debugger agent port forwarder
$ docker run -d -p 5858:5858 --name socat local/socat socat TCP-LISTEN:5858,fork TCP:${TODOMVC_IP}:4848

Any traffic that will arrive at the 5858 port on the Docker host will be forwarded, first, to the 4848 socat port and then to the Node.js debugger agent running inside the todomvc Docker container.

Exposing Node.js debugger port for remote access

In most cases, I would like to debug an application running on a remote machine (AWS EC2 instance, for example). I also do not want to expose a Node.js debugger agent port unprotected to the whole world.

One possible and working solution is to use SSH tunneling to access this port.

$ # Open SSH Tunnel to gain access to servers port 5858. Set `SSH_KEY_FILE` to ssh key location or add it to ssh-agent
$ #
$ # open an ssh tunnel, send it to the bg, and wait 20 seconds for connections
$ # once all connections are closed after 20 seconds then close the tunnel
$ ssh -i ${SSH_KEY_FILE} -f -o ExitOnForwardFailure=yes -L 5858:127.0.0.1:5858 ec2_user@some.ec2.host.com sleep 20

Now all traffic to the localhost:5858 will be tunneled over SSH to the remote Docker host machine and after some socat forwarding to the Node.js debugger agent running inside the todomvc container.

Step 3: Synchronizing on the same code commit

In order to be able to debug a remote application, you need to make sure that you are using the same code in your IDE as one that is running on remote server.

I will try to automate this step too. Remember the LABEL command, I’ve used in TodoMVC Dockerfile?

These labels help me to identify git repository and commit for the application Docker image:

  1. org.label-schema.vcs-ref - contains short SHA for a HEAD commit
  2. org.label-schema.vcs-url - contains an application git repository url (I can use in clone/pull)

I’m using (Label Schema Convention)[http://label-schema.org/rc1/], since I really like it and find it useful, but you can select any other convention too.

This approach allows me, for each, properly labeled, Docker image, to identify the application code repository and the commit it was created from.

$ # get git repository url form Docker image
$ GIT_URL=$(docker inspect local/todomvc | jq -r '.[].ContainerConfig.Labels."org.label-schema.vcs-url"')
$ # get git commit from Docker image
$ GIT_COMMIT=$(docker inspect local/todomvc | jq -r '.[].ContainerConfig.Labels."org.label-schema.vcs-ref"')
$ 
$ # clone git repository, if needed
$ git clone $GIT_URL
$ # set HEAD to same commit as server
$ git checkout $GIT_COMMIT

Now, both my local development environment and remote application are on the same git commit. And I can start to debug my code, finally!

Step 4: Attaching local Node.js debugger to debugger agent port

To start debugging, I need to configure my IDE. In my case, it’s Visual Studio Code and I need to add a new Launch configuration.

This launch configuration specifies remote debugger server and port to attach and remote location for application source files, which should be in sync with local files (see the previous step).

{
    // For more information about Node.js debug attributes, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "type": "node",
            "request": "attach",
            "name": "Debug Remote Docker",
            "address": "127.0.0.1",
            "port": 5858,
            "localRoot": "${workspaceRoot}/",
            "remoteRoot": "/usr/src/app/"
        }
    ]
}

Summary

And finally, I’ve met my goal: I’m able to attach a Node.js debugger to a Node.js application, that is already up and running in a Docker container on a remote machine.

It was a long journey to find the proper solution, but after I found it, the process does not look complex at all. Now, once I met a new problem in our environment I can easily attach the Node.js debugger to the running application and start exploring the problem. Nice, isn’t it?

I’ve recorded a short movie, just to demonstrate all steps and prove that things are working fluently, exactly as I’ve described in this post.


Hope, you find this post useful. I look forward to your comments and any questions you have.

25 Apr 2017, 18:00

Create lean Node.js image with Docker multi-stage build

TL;DR

Starting from Docker 17.05+, you can create a single Dockerfile that can build multiple helper images with compilers, tools, and tests and use files from above images to produce the final Docker image.

Multi-stage Docker Build

The “core principle” of Dockerfile

Docker can build images by reading the instructions from a Dockerfile. A Dockerfile is a text file that contains a list of all the commands needed to build a new Docker image. The syntax of Dockerfile is pretty simple and the Docker team tries to keep it intact between Docker engine releases.

The core principle is very simple: 1 Dockerfile -> 1 Docker Image.

This principle works just fine for basic use cases, where you just need to demonstrate Docker capabilities or put some “static” content into a Docker image.

Once you advance with Docker and would like to create secure and lean Docker images, singe Dockerfile is not enough.

People who insist on following the above principle find themselves with slow Docker builds, huge Docker images (several GB size images), slow deployment time and lots of CVE violations embedded into these images.

The Docker Build Container pattern

Docker Pattern: The Build Container

The basic idea behind Build Container pattern is simple:

Create additional Docker images with required tools (compilers, linters, testing tools) and use these images to produce lean, secure and production ready Docker image.

=============

An example of the Build Container pattern for typical Node.js application:

  1. Derive FROM a Node base image (for example node:6.10-alpine) node and npm installed (Dockerfile.build)
  2. Add package.json
  3. Install all node modules from dependency and devDependency
  4. Copy application code
  5. Run compilers, code coverage, linters, code analysis and testing tools
  6. Create the production Docker image; derive FROM same or other Node base image
  7. install node modules required for runtime (npm install --only=production)
  8. expose PORT and define default CMD (command to run your application)
  9. Push the production image to some Docker registry

This flow assumes that you are using two or more separate Dockerfiles and a shell script or flow tool to orchestrate all steps above.

Example

I use a fork of Let’s Chat node.js application.

Builder Docker image with eslint, mocha and gulp

FROM alpine:3.5
# install node 
RUN apk add --no-cache nodejs
# set working directory
WORKDIR /root/chat
# copy project file
COPY package.json .
# install node packages
RUN npm set progress=false && \
    npm config set depth 0 && \
    npm install
# copy app files
COPY . .
# run linter, setup and tests
CMD npm run lint && npm run setup && npm run test

Production Docker image with ‘production’ node modules only

FROM alpine:3.5
# install node
RUN apk add --no-cache nodejs tini
# set working directory
WORKDIR /root/chat
# copy project file
COPY package.json .
# install node packages
RUN npm set progress=false && \
    npm config set depth 0 && \
    npm install --only=production && \
    npm cache clean
# copy app files
COPY . .
# Set tini as entrypoint
ENTRYPOINT ["/sbin/tini", "--"]
# application server port
EXPOSE 5000
# default run command
CMD npm run start

What is Docker multi-stage build?

Docker 17.0.5 extends Dockerfile syntax to support new multi-stage build, by extending two commands: FROM and COPY.

The multi-stage build allows using multiple FROM commands in the same Dockerfile. The last FROM command produces the final Docker image, all other images are intermediate images (no final Docker image is produced, but all layers are cached).

The FROM syntax also supports AS keyword. Use AS keyword to give the current image a logical name and reference to it later by this name.

To copy files from intermediate images use COPY --from=<image_AS_name|image_number>, where number starts from 0 (but better to use logical name through AS keyword).

Creating a multi-stage Dockerfile for Node.js application

The Dockerfile below makes the Build Container pattern obsolete, allowing to achieve the same result with the single file.

#
# ---- Base Node ----
FROM alpine:3.5 AS base
# install node
RUN apk add --no-cache nodejs-npm tini
# set working directory
WORKDIR /root/chat
# Set tini as entrypoint
ENTRYPOINT ["/sbin/tini", "--"]
# copy project file
COPY package.json .

#
# ---- Dependencies ----
FROM base AS dependencies
# install node packages
RUN npm set progress=false && npm config set depth 0
RUN npm install --only=production 
# copy production node_modules aside
RUN cp -R node_modules prod_node_modules
# install ALL node_modules, including 'devDependencies'
RUN npm install

#
# ---- Test ----
# run linters, setup and tests
FROM dependencies AS test
COPY . .
RUN  npm run lint && npm run setup && npm run test

#
# ---- Release ----
FROM base AS release
# copy production node_modules
COPY --from=dependencies /root/chat/prod_node_modules ./node_modules
# copy app sources
COPY . .
# expose port and define CMD
EXPOSE 5000
CMD npm run start

The above Dockerfile creates 3 intermediate Docker images and single release Docker image (the final FROM).

  1. First image FROM alpine:3.5 AS bas - is a base Node image with: node, npm, tini (init app) and package.json
  2. Second image FROM base AS dependencies - contains all node modules from dependencies and devDependencies with additional copy of dependencies required for final image only
  3. Third image FROM dependencies AS test - runs linters, setup and tests (with mocha); if this run command fail not final image is produced
  4. The final image FROM base AS release - is a base Node image with application code and all node modules from dependencies

Try Docker multi-stage build today

In order to try Docker multi-stage build, you need to get Docker 17.0.5, which is going to be released in May and currently available on the beta channel.

So, you have two options:

  1. Use beta channel to get Docker 17.0.5
  2. Run dind container (docker-in-docker)

Running Docker-in-Docker 17.0.5 (beta)

Running Docker 17.0.5 (beta) in docker container (--privileged is required):

$ docker run -d --rm --privileged -p 23751:2375 --name dind docker:17.05.0-ce-dind --storage-driver overlay2

Try mult-stage build. Add --host=:23751 to every Docker command, or set DOCKER_HOST environment variable.

$ # using --host
$ docker --host=:23751 build -t local/chat:multi-stage .

$ # OR: setting DOCKER_HOST
$ export DOCKER_HOST=localhost:23751
$ docker build -t local/chat:multi-stage .

Summary

With Docker multi-stage build feature, it’s possible to implement an advanced Docker image build pipeline using a single Dockerfile. Kudos to Docker team!


Hope, you find this post useful. I look forward to your comments and any questions you have.

07 Mar 2017, 18:00

Crafting perfect Java Docker build flow

TL;DR

What is the bare minimum you need to build, test and run my Java application in Docker container?

The recipe: Create a separate Docker image for each step and optimize the way you are running it.

Duke and Container

Introduction

I started working with Java in 1998, and for a long time, it was my main programming language. It was a long love–hate relationship.

DDuring my work career, I wrote a lot of code in Java. Despite that fact, I don’t think Java is usually the right choice for writing microservices running in Docker containers.

But, sometimes you have to work with Java. Maybe Java is your favorite language and you do not want to learn a new one, or you have a legacy code that you need to maintain, or your company decided on Java and you have no other option.

Whatever reason you have to marry Java with Docker, you better do it properly.

In this post, I will show you how to create an effective Java-Docker build pipeline to consistently produce small, efficient, and secure Docker images.

Be careful

There are plenty of “Docker for Java developers” tutorials out there, that unintentionally encourage some Docker bad practices.

For example:

For current demo project, first two tutorials took around 15 minutes to build (first build) and produced images of 1.3GB size each.

Make yourself a favor and do not follow these tutorials!

What should you know about Docker?

Developers new to Docker are often tempted to think of it as just another VM. Instead, think of Docker as a “child process”. The files and packages needed for an entire VM are different from those needed by just another process running a dev machine. Docker is even better than a child process because it allows better isolation and environmental control.

If you’re new to Docker, I suggest reading this Understanding Docker article. Docker isn’t so complex than any developer should not be able to understand how it works.

Dockerizing Java application

What files need to be included in a Java Application’s Docker image?

Since Docker containers are just isolated processes, your Java Docker image should only contain the files required to run your application.

What are these files?

It starts with a Java Runtime Environment (JRE). JRE is a software package, that has everything required to run a Java program. It includes an implementation of the Java Virtual Machine (JVM) with an implementation of the Java Class Library.

I recommend using OpenJDK JRE. OpenJDK is licensed under GPL with Classpath Exception. The Classpath Exception part is important. This license allows using OpenJDK with any software of any license, not just the GPL. In particular, you can use OpenJDK in proprietary software without disclosing your code.

Before using Oracle’s JDK/JRE, please read the following post: “Running Java on Docker? You’re Breaking the Law.”

Since it’s rare for Java applications to be developed using only the standard library, you most likely need to also add 3rd party Java libraries. Then add the application compiled bytecode as plain Java Class files or packaged into JAR archives. And, if you are using native code, you will need to add corresponding native libraries/packages too.

Choosing a base Docker image for Java Application

In order to choose the base Docker image, you need to answer the following questions:

  • What native packages do you need for your Java application?
  • Should you choose Ubuntu or Debian as your base image?
  • What is your strategy for patching security holes, including packages you are not using at all?
  • Do you mind paying extra (money and time) for network traffic and storage of unused files?

Some might say: “but, if all your images share the same Docker layers, you only download them just once, right?”

That’s true in theory, but in reality is often very different.

Usually, you have lots of different images: some you built lately, others a long time ago, others you pull from DockerHub. All these images do not share the same base image or version. You need to invest a lot of time to align these images to share the same base image and then keep these images up-to-date.

Some might say: “but, who cares about image size? we download them just once and run forever”.

Docker image size is actually very important.

The size has an impact on …

  • network latency - need to transfer Docker image over the web
  • storage - need to store all these bits somewhere
  • service availability and elasticity - when using a Docker scheduler, like Kubernetes, Swarm, DC/OS or other (scheduler can move containers between hosts)
  • security - do you really, I mean really need the libpng package with all its CVE vulnerabilities for your Java application?
  • development agility - small Docker images == faster build time and faster deployment

Without being careful, Java Docker images tends to grow to enormous sizes. I’ve seen 3GB Java images, where the real code and required JAR libraries only take around 150MB.

Consider using Alpine Linux image, which is only a 5MBs image, as a base Docker image. Lots of “Official Docker images” have an Alpine-based flavor.

Note: Many, but not all Linux packages have versions compiled with musl libc C runtime library. Sometimes you want to use a package that is compiled with glibc (GNU C runtime library). The frolvlad/alpine-glibc image based on Alpine Linux image and contains glibc to enable proprietary projects, compiled against glibc (e.g. OracleJDK, Anaconda), working on Alpine.

Choosing the right Java Application server

Frequently, you also need to expose some kind of interface to reach your Java application, that runs in a Docker container.

When you deploy Java applications with Docker containers, the default Java deployment model changes.

Originally, Java server-side deployment assumes that you have already pre-configured a Java Web Server (Tomcat, WebLogic, JBoss, or other) and you are deploying an application WAR (Web Archive) packaged Java application to this server and run it together with other applications, deployed on the same server.

Lots of tools are developed around this concept, allowing you to update running applications without stopping the Java Application server, route traffic to the new application, resolve possible class loading conflicts and more.

With Docker-based deployments, you do not need these tools anymore, you don’t even need the fat “enterprise-ready” Java Application servers. The only thing that you need is a stable and scalable network server that can serve your API over HTTP/TCP or other protocol of your choice. Search Google for “embedded Java server” and take one that you like most.

For this demo, I forked Spring Boot’s REST example and modified it a bit. The demo uses Spring Boot with an embedded Tomcat server. Here is my fork on GitHub repository (blog branch).

Building a Java Application Docker image

In order to run this demo, I need to create a Docker image with JRE, the compiled and packaged Java application, and all 3rd party libraries.

Here is the Dockerfile I used to build my Docker image. This demo Docker image is based on slim Alpine Linux with OpenJDK JRE and contains the application WAR file with all dependencies embedded into it. It’s just the bare minimum required to run the demo application.

# Base Alpine Linux based image with OpenJDK JRE only
FROM openjdk:8-jre-alpine

# copy application WAR (with libraries inside)
COPY target/spring-boot-*.war /app.war

# specify default command
CMD ["/usr/bin/java", "-jar", "-Dspring.profiles.active=test", "/app.war"]

To build the Docker image, run the following command:

$ docker build -t blog/sbdemo:latest .

Running the docker history command on created Docker image will let you to see all layers that make up this image:

  • 4.8MB Alpine Linux Layer
  • 103MB OpenJDK JRE Layer
  • 61.8MB Application WAR file

    $ docker history blog/sbdemo:latest
    
    IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
    16d5236aa7c8        About an hour ago   /bin/sh -c #(nop)  CMD ["/usr/bin/java" "-...   0 B                 
    e1bbd125efc4        About an hour ago   /bin/sh -c #(nop) COPY file:1af38329f6f390...   61.8 MB             
    d85b17c6762e        2 months ago        /bin/sh -c set -x  && apk add --no-cache  ...   103 MB              
    <missing>           2 months ago        /bin/sh -c #(nop)  ENV JAVA_ALPINE_VERSION...   0 B                 
    <missing>           2 months ago        /bin/sh -c #(nop)  ENV JAVA_VERSION=8u111       0 B                 
    <missing>           2 months ago        /bin/sh -c #(nop)  ENV PATH=/usr/local/sbi...   0 B                 
    <missing>           2 months ago        /bin/sh -c #(nop)  ENV JAVA_HOME=/usr/lib/...   0 B                 
    <missing>           2 months ago        /bin/sh -c {   echo '#!/bin/sh';   echo 's...   87 B                
    <missing>           2 months ago        /bin/sh -c #(nop)  ENV LANG=C.UTF-8             0 B                 
    <missing>           2 months ago        /bin/sh -c #(nop) ADD file:eeed5f514a35d18...   4.8 MB              
    

Running the Java Application Docker container

In order to run the demo application, run following command:

$ docker run -d --name demo-default -p 8090:8090 -p 8091:8091 blog/sbdemo:latest

Let’s check, that application is up and running (I’m using the httpie tool here):

$ http http://localhost:8091/info

HTTP/1.1 200 OK
Content-Type: application/json
Date: Thu, 09 Mar 2017 14:43:28 GMT
Server: Apache-Coyote/1.1
Transfer-Encoding: chunked

{
    "build": {
        "artifact": "${project.artifactId}",
        "description": "boot-example default description",
        "name": "spring-boot-rest-example",
        "version": "0.1"
    }
}

Setting Docker container memory constraints

One thing you need to know about Java process memory allocation is that in reality it consumes more physical memory than specified with the -Xmx JVM option. The -Xmx option specifies only the maximum Java heap size. But the Java process is a regular Linux process and what is interesting, is how much actual physical memory this process is consuming.

Or in other words - what is the Resident Set Size (RSS) value for running a Java process?

Theoretically, in the case of a Java application, a required RSS size can be calculated by:

RSS = Heap size + MetaSpace + OffHeap size

where OffHeap consists of thread stacks, direct buffers, mapped files (libraries and jars) and JVM code itself.

There is a very good post on this topic: Analyzing java memory usage in a Docker container by Mikhail Krestjaninoff.

When using the --memory option in docker run make sure the limit is larger (at least twice) than what you specify for -Xmx.

Offtopic: Using OOM Killer instead of GC

There is an interesting JDK Enhancement Proposal (JEP) by Aleksey Shipilev: [Epsilon GC]((http://openjdk.java.net/jeps/8174901). This JEP proposes to develop a GC that only handles memory allocation, but does not implement any actual memory reclamation mechanism.

This GC, combined with --restart (Docker restart policy) should theoretically allow supporting “Extremely short lived jobs” implemented in Java.

For ultra-performance-sensitive applications, where developers are conscious about memory allocations or want to create completely garbage-free applications - GC cycle may be considered an implementation bug that wastes cycles for no good reason. In such use case, it could be better to allow OOM Killer (Out of Memory) to kill the process and use Docker restart policy to restarting the process.

Anyway, Epsilon GC is not available yet, so it’s just an interesting theoretical use case for a moment.

Building Java applications with Builder container

As you can probably see, in the previous step, I did not explain how I’ve created the application WAR file.

Of course, there is a Maven project file pom.xml which most Java developers should be familiar with. But, in order to actually build, you need to install the same Java Build tools (JDK and Maven) on every machine, where you are building the application. You need to have the same versions, use the same repositories and share the same configurations. While’s tt’s possible, managing different projects that rely on different tools, versions, configurations, and development environments can quickly become a nightmare.

What if you might also want to run a build on a clean machine that does not have Java or Maven installed? What should you do?

Java Builder Container

Docker can help here too. With Docker, you can create and share portable development and build environments. The idea is to create a special Builder Docker image, that contains all tools you need to properly build your Java application, e.g.: JDK, Ant, Maven, Gradle, SBT or others.

To create a really useful Builder Docker image, you need to know well how you Java Build tools are working and how docker build invalidates build cache. Without proper design, you will end up with non-effective and slow builds.

Running Maven in Docker

While most of these tools were created nearly a generation ago, they are still are very popular and widely used by Java developers.

Java development life is hard to imagine without some extra build tools. There are multiple Java build tools out there, but most of them share similar concepts and serve the same targets - resolve cumbersome package dependencies, and run different build tasks, such as, compile, lint, test, package, and deploy.

In this post, I will use Maven, but the same approach can be applied to Gradle, SBT, and other less popular Java Build tools.

It’s important to learn how your Java Build tool works and how can it’s tuned. Apply this knowledge, when creating a Builder Docker image and the way you are running a Builder Docker container.

Maven uses the project level pom.xml file to resolve project dependencies. It downloads missing JAR files from private and public Maven repositories, and caches these files for future builds. Thus, next time you run your build, it won’t download anything if your dependency had not been changed.

Official Maven Docker image: should you use it?

The Maven team provides an official Docker images. There are multiple images (under different tags) that allow you to select an image that can answer your needs. Take a deeper look at the Dockerfile files and mvn-entrypoint.sh shell scripts when selecting Maven image to use.

There are two flavors of official Maven Docker images: regular images (JDK version, Maven version, and Linux distro) and onbuild images.

What is the official Maven image good for?

The official Maven image does a good job containerizing the Maven tool itself. The image contains some JDK and Maven version. Using such image, you can run Maven build on any machine without installing a JDK and Maven.

Example: running mvn clean install on local folder

$ docker run -it --rm --name my-maven-project -v "$PWD":/usr/src/app -w /usr/src/app maven:3.2-jdk-7 mvn clean install

Maven local repository, for official Maven images, is placed inside a Docker data volume. That means, all downloaded dependencies are not part of the image and will disappear once the Maven container is destroyed. If you do not want to download dependencies on every build, mount Maven repository Docker volume to some persistent storage (at least local folder on the Docker host).

Example: running mvn clean install on local folder with properly mounted Maven local repository

$ docker run -it --rm --name my-maven-project -v "$PWD":/usr/src/app -v "$HOME"/.m2:/root/.m2 -w /usr/src/app maven:3.2-jdk-7 mvn clean install

Now, let’s take a look at onbuild Maven Docker images.

What is Maven onbuild image?

Maven onbuild Docker image exists to “simplify” developer’s life, allowing him/er skip writing a Dockerfile. Actually, a developer should write a Dockerfile, but it’s usually enough to have the single line in it:

FROM maven:<versions>-onbuild

Looking into onbuild Dockerfile on the GitHub repository …

FROM maven:<version>

RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app

ONBUILD ADD . /usr/src/app
ONBUILD RUN mvn install

… you can see several Dockerfile commands with the ONBUILD prefix. The ONBUILD tells Docker to postpone the execution of these build commands until building a new image that inherits from the current image.

In our example, two build commands will be executed, when you build the application Dockerfile created FROM: maven:<version>-onbuild :

  • Add current folder (all files, if you are not using .dockerignore) to the new Docker image
  • Run mvn install Maven target

The onbuild Maven Docker image is not as useful as the previous image.

First of all, it copies everything from the current repository, so do not use it without a properly configured .dockerignore file.

Then, think: what kind of image you are trying to build?

The new image, created from onbuild Maven Docker image, includes JDK, Maven, application code (and potentially all files from current directory), and all files produced by Maven install phase (compiled, tested and packaged app; plus lots of build junk files you do not really need).

So, this Docker image contains everything, but, for some strange reason, does not contain a local Maven repository. I have no idea why the Maven team created this image.

Recommendation: Do not use Maven onbuild images!

If you just want to use Maven tool, use non-onbuild image.

If you want to create proper Builder image, I will show you how to do this later in this post.

Where to keep Maven cache?

Official Maven Docker image chooses to keep Maven cache folder outside of the container, exposing it as a Docker data volume, using VOLUME root/.m2 command in the Dockerfile. A Docker data volume is a directory within one or more containers that bypasses the Docker Union File System, in simple words: it’s not part of the Docker image.

What you should know about Docker data volumes:

  • Volumes are initialized when a container is created.
  • Data volumes can be shared and reused among containers.
  • Changes to a data volume are made directly to the mounted endpoint (usually some directory on host, but can be some storage device too)
  • Changes to a data volume will not be included when you update an image or persist Docker container.
  • Data volumes persist even if the container itself is deleted.

So, in order to reuse Maven cache between different builds, mount a Maven cache data volume to some persistent storage (for example, a local directory on the Docker host).

$ docker run -it --rm --volume "$PWD"/pom.xml://usr/src/app/pom.xml --volume "$HOME"/.m2:/root/.m2 maven:3-jdk-8-alpine mvn install

The command above runs the official Maven Docker image (Maven 3 and OpenJDK 8), mounts project pom.xml file into working directory and $HOME"/.m2 folder for Maven cache data volume. Maven running inside this Docker container will download all required JAR files into host’s local

Maven running inside this Docker container will download all required JAR files into host’s local folder $HOME/.m2. Next time you create new Maven Docker container for the same pom.xml file and the same cache mount, Maven will reuse the cache and will download only missing or updated JAR files.

Maven Builder Docker image

First, let’s try to formulate what is the Builder Docker image and what should it contain?

Builder is a Docker image that contains everything to allow you creating a reproducible build on any machine and at any point of time.

So, what should it contain?

  • Linux shell and some tools - I prefer Alpine Linux
  • JDK (version) - for the javac compiler
  • Maven (version) - Java build tool
  • Application source code and pom.xml file/s - it’s the application code SNAPSHOT at specific point of time; just code, no need to include a .git repository or other files
  • Project dependencies (Maven local repository) - all POM and JAR files you need to build and test Java application, at any time, even offline, even if library disappear from the web

The Builder image captures code, dependencies, and tools at a specific point of time and stores them inside a Docker image. The Builder container can be used to create the application “binaries” on any machine, at any time and even without internet connection (or with poor connection).

Here is the sample Dockerfile for my demo Builder:

FROM openjdk:8-jdk-alpine

# ----
# Install Maven
RUN apk add --no-cache curl tar bash

ARG MAVEN_VERSION=3.3.9
ARG USER_HOME_DIR="/root"

RUN mkdir -p /usr/share/maven && \
  curl -fsSL http://apache.osuosl.org/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-$MAVEN_VERSION-bin.tar.gz | tar -xzC /usr/share/maven --strip-components=1 && \
  ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven
ENV MAVEN_CONFIG "$USER_HOME_DIR/.m2"
# speed up Maven JVM a bit
ENV MAVEN_OPTS="-XX:+TieredCompilation -XX:TieredStopAtLevel=1"

ENTRYPOINT ["/usr/bin/mvn"]

# ----
# Install project dependencies and keep sources 

# make source folder
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app

# install maven dependency packages (keep in image)
COPY pom.xml /usr/src/app
RUN mvn -T 1C install && rm -rf target

# copy other source files (keep in image)
COPY src /usr/src/app/src

Let’s go over this Dockerfile and I will try to explain the reasoning behind each command.

  • FROM: openjdk:8-jdk-alpine - select and freeze JDK version: OpenJDK 8 and Linux Alpine
  • Install Maven
    • ARG ... - Use build arguments to allow overriding Maven version and local repository location (MAVEN_VERSION and USER_HOME_DIR) with docker build --build-arg ...
    • RUN mkdir -p ... curl ... tar ... - Download and install (untar and ln -s) Apache Maven
    • Speed up Maven JVM a bit: MAVEN_OPTS="-XX:+TieredCompilation -XX:TieredStopAtLevel=1", read the following post
  • RUN mvn -T 1C install && rm -rf target Download project dependencies:
    • Copy project pom.xml file and run mvn install command and remove build artifacts as far as I know, there is no Maven command that will let you download without installing)
    • This Docker image layer will be rebuilt only when project’s pom.xml file changes
  • COPY src /usr/src/app/src - copy project source files (source, tests, and resources)

Note: if you are using Maven Surefire plugin and want to have all dependencies for the offline build, make sure to lock down Surefire test provider.

When you build a new Builder version, I suggest you use a --cache-from option passing previous Builder image to it. This will allow you reuse any unmodified Docker layer and avoid obsolete downloads most of the time (if pom.xml did not change or you did not decide to upgrade Maven or JDK).

$ # pull latest (or specific version) builder image
$ docker pull myrep/mvn-builder:latest
$ # build new builder
$ docker build -t myrep/mvn-builder:latest --cache-from myrep/mvn-builder:latest .
Use Builder container to run tests
$ # run tests - test results are saved into $PWD/target/surefire-reports
$ docker run -it --rm -v "$PWD"/target:/usr/src/app/target myrep/mvn-builder -T 1C -o test
Use Builder container to create application WAR
$ # create application WAR file (skip tests) - $PWD/target/spring-boot-rest-example-0.3.0.war
$ docker run -it --rm -v $(shell pwd)/target:/usr/src/app/target myrep/mvn-builder package -T 1C -o -Dmaven.test.skip=true

Summary

Take a look at images bellow:

REPOSITORY      TAG     IMAGE ID     CREATED        SIZE
sbdemo/run      latest  6f432638aa60 7 minutes ago  143 MB
sbdemo/tutorial 1       669333d13d71 12 minutes ago 1.28 GB
sbdemo/tutorial 2       38634e4d9d5e 3 hours ago    1.26 GB
sbdemo/builder  mvn     2d325a403c5f 5 days ago     263 MB
  • sbdemo/run:latest - Docker image for demo runtime: Alpine, OpenJDK JRE only, demo WAR
  • sbdemo/builder:mvn - Builder Docker image: Alpine, OpenJDK 8, Maven 3, code, dependency
  • sbdemo/tutorial:1 - Docker image created following first tutorial (just for reference)
  • sbdemo/tutorial:2 - Docker image created following second tutorial (just for reference)

Bonus: Build flow automation

In this section, I will show how to use Docker build flow automation service to automate and orchestrate all steps from this post.

Build Pipeline Steps

I’m going to use Codefresh.io Docker CI/CD service (the company I’m working for) to create a Builder Docker image for Maven, run tests, create application WAR, build Docker image for application and deploy it to DockerHub.

The Codefresh automation flow YAML (also called pipeline) is pretty straight forward:

  • it contains ordered list of steps
  • each step can be of type:
  • - build - for docker build command
  • - push - for docker push
  • - composition - for creating environment, specified with docker-compose
  • - freestyle (default if not specified) - for docker run command
  • /codefresh/volume/ data volume (git clone and files generated by steps) is mounted into each step
  • current working directory for each step is set to /codefresh/volume/ by default (can be changed)

For detailed description and other examples, take a look at the documentation.

For my demo flow I’ve created following automation steps:

  1. mvn_builder - create Maven Builder Docker image
  2. mv_test - execute tests in Builder container, place test results into /codefresh/volume/target/surefire-reports/ data volume folder
  3. mv_package - create application WAR file, place created file into /codefresh/volume/target/ data volume folder
  4. build_image - build application Docker image with JRE and application WAR file
  5. push_image - tag and push the application Docker image to DockerHub

Here is the full Codefresh YAML:

version: '1.0'

steps:

  mvn_builder:
    type: build
    description: create Maven builder image
    dockerfile: Dockerfile.build
    image_name: <put_you_repo_here>/mvn-builder

  mvn_test:
    description: run unit tests 
    image: ${{mvn_builder}}
    commands:
      - mvn -T 1C -o test
  
  mvn_package:
    description: package application and dependencies into WAR 
    image: ${{mvn_builder}}
    commands:
      - mvn package -T 1C -o -Dmaven.test.skip=true

  build_image:
    type: build
    description: create Docker image with application WAR
    dockerfile: Dockerfile
    working_directory: ${{main_clone}}/target
    image_name: <put_you_repo_here>/sbdemo

  push_image:
    type: push
    description: push application image to DockerHub
    candidate: '${{build_image}}'
    tag: '${{CF_BRANCH}}'
    credentials:
      # set docker registry credentials in project configuration
      username: '${{DOCKER_USER}}'
      password: '${{DOCKER_PASS}}'

Hope, you find this post useful. I look forward to your comments and any questions you have.


_This is a working draft version. The final post version is published at Codefresh Blog on March 22, 2017._

02 Jan 2017, 18:00

Everyday hacks for Docker

In this post, I’ve decided to share with you some useful commands and tools, I’m frequently using, working with amazing Docker technology. There is no particular order or “coolness level” for every “hack”. I will try to present the use case and how does specific command or tool help me with my work.

Docker Hacks

Cleaning up

Working with Docker for some time, you start to accumulate development junk: unused volumes, networks, exited containers and unused images.

One command to “rule them all”

$ docker system  prune

prune is a very useful command (works also for volume and network sub-commands), but it’s only available for Docker 1.13. So, if you are using older Docker versions, then following commands can help you to replace the prune command.

Remove dangling volumes

dangling volumes - volumes not in use by any container. To remove them, combine two commands: first, list volume IDs for dangling volumes and then remove them.

$ docker volume rm $(docker volume ls -q -f "dangling=true")

Remove exited containers

The same principle works here too: first, list containers (only IDs) you want to remove (with filter) and then remove them (consider rm -f to force remove).

$ docker rm $(docker ps -q -f "status=exited")

Remove dangling images

dangling images are Docker untagged images, that are the leaves of the images tree (not intermediary layers).

docker rmi $(docker images -q -f "dangling=true")

Autoremove interactive containers

When you run a new interactive container and want to avoid typing rm command after it exits, use --rm option. Then when you exit from created container, it will be automatically destroyed.

$ docker run -it --rm alpine sh

Inspect Docker resources

jq - jq is a lightweight and flexible command-line JSON processor. It is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.

docker info and docker inspect commands can produce output in JSON format. Combine these commands with jq processor.

Pretty JSON and jq processing

# show whole Docker info
$ docker info --format "{{json .}}" | jq .

# show Plugins only
$ docker info --format "{{json .Plugins}}" | jq .

# list IP addresses for all containers connected to 'bridge' network
$ docker network inspect bridge -f '{{json .Containers}}' | jq '.[] | {cont: .Name, ip: .IPv4Address}'

Watching containers lifecycle

Sometimes you want to see containers being activated and exited, when you run some docker commands or try different restart policies. watch command combined with docker ps can be pretty useful here. The docker stats command, even with --format option is not useful for this use case since it does not allow you to see same info as you can see with docker ps command.

Display a table with ‘ID Image Status’ for active containers and refresh it every 2 seconds

$ watch -n 2 'docker ps --format "table {{.ID}}\t {{.Image}}\t {{.Status}}"'

Enter into host/container Namespace

Sometimes you want to connect to the Docker host. The ssh command is the default option, but this option can be either not available, due to security settings, firewall rules or not documented (try to find how to ssh into Docker for Mac VM).

nsenter, by Jérôme Petazzoni, is a small and very useful tool for above cases. nsenter allows to enter into namespaces. I like to use minimalistic (580 kB) walkerlee/nsenter Docker image.

Enter into Docker host

Use --pid=host to enter into Docker host namespaces

# get a shell into Docker host
$ docker run --rm -it --privileged --pid=host walkerlee/nsenter -t 1 -m -u -i -n sh

Enter into ANY container

It’s also possible to enter into any container with nsenter and --pid=container:[id OR name]. But in most cases, it’s better to use standard docker exec command. The main difference is that nsenter doesn’t enter the cgroups, and therefore evades resource limitations (can be useful for debugging).

# get a shell into 'redis' container namespace
$ docker run --rm -it --privileged --pid=container:redis walkerlee/nsenter -t 1 -m -u -i -n sh

Heredoc Docker container

Suppose you want to get some tool as a Docker image, but you do not want to search for a suitable image or to create a new Dockerfile (no need to keep it for future use, for example). Sometimes storing a Docker image definition in a file looks like an overkill - you need to decide how do you edit, store and share this Dockerfile. Sometimes it’s better just to have a single line command, that you can copy, share, embed into a shell script or create special command alias. So, when you want to create a new ad-hoc container with a single command, try a Heredoc approach.

Create Alpine based container with ‘htop’ tool

$ docker build -t htop - << EOF
FROM alpine
RUN apk --no-cache add htop
EOF

Docker command completion

Docker CLI syntax is very rich and constantly growing: adding new commands and new options. It’s hard to remember every possible command and option, so having a nice command completion for a terminal is a must have.

Command completion is a kind of terminal plugin, that lets you auto-complete or auto-suggest what to type in next by hitting tab key. Docker command completion works both for commands and options. Docker team prepared command completion for docker, docker-machine and docker-compose commands, both for Bash and Zsh.

If you are using Mac and Homebrew, then installing Docker commands completion is pretty straight forward.

# Tap homebrew/completion to gain access to these
$ brew tap homebrew/completions

# Install completions for docker suite
$ brew install docker-completion
$ brew install docker-compose-completion
$ brew install docker-machine-completion

For non-Mac install read official Docker documentation: docker engine, docker-compose and docker-machine

Start containers automatically

When you are running process in Docker container, it may fail due to multiple reasons. Sometimes to fix this failure it’s enough to rerun the failed container. If you are using Docker orchestration engine, like Swarm or Kubernetes, the failed service will be restarted automatically. But if you are using plain Docker and want to restart container, based on exit code of container’s main process or always (regardless the exit code), Docker 1.12 introduced a very helpful option for docker run command: restart.

Restart always

Restart the redis container with a restart policy of always so that if the container exits, Docker will restart it.

$ docker run --restart=always redis

Restart container on failure

Restart the redis container with a restart policy of on-failure and a maximum restart count of 10.

$ docker run --restart=on-failure:10 redis

Network tricks

There are cases when you want to create a new container and connect it to already existing network stack. It can be Docker host network or another container’s network. This can be pretty useful for debugging and audition network issues. The docker run --network/net option support this use case.

Use Docker host network stack

$ docker run --net=host ...

The new container will attach to same network interfaces as the Docker host.

Use another container’s network stack

$ docker run --net=container:<name|id> ...

The new container will attach to same network interfaces as another container. The target container can be specified with id or name.

Attachable overlay network

Using docker engine running in swarm mode, you can create a multi-host overlay network on a manager node. When you create a new swarm service, you can attach it to the previously created overlay network.

Sometimes to inspect network configuration or debug network issues, you want to attach a new Docker container, filled with different network tools, to existing overlay network and do this with docker run command and not to create a new “debug service”.

Docker 1.13 brings a new option to docker network create command: attachable. The attachable option enables manual container attachment.

# create an attachable overlay network
$ docker network create --driver overlay --attachable mynet
# create net-tools container and attach it to mynet overlay network
$ docker run -it --rm --net=mynet net-tools sh

18 Dec 2016, 14:00

Deploy Docker Compose (v3) to Swarm (mode) Cluster

Disclaimer: all code snippets bellow are working only with Docker 1.13+

TL;DR

Docker 1.13 simplifies deployment of composed application to a swarm (mode) cluster. And you can do it without creating a new dab (Distribution Application Bundle) file, but just using familiar and well-known docker-compose.yml syntax (with some additions) and --compose-file option.

Compose to Swarm

Swarm cluster

Docker Engine 1.12 introduced a new swarm mode for natively managing a cluster of Docker Engines called a swarm. Docker swarm mode implements Raft Consensus Algorithm and does not require using external key value store anymore, such as Consul or etcd.

If you want to run a swarm cluster on a developer’s machine, there are several options.

The first option and most widely known, is to use a docker-machine tool with some virtual driver (Virtualbox, Parallels or other).

But, in this post I will use another approach: using docker-in-docker Docker image with Docker for Mac, see more details in my Docker Swarm cluster with docker-in-docker on MacOS post.

Docker Registry mirror

When you deploy a new service on local swarm cluster, I recommend to setup local Docker registry mirror and run all swarm nodes with --registry-mirror option, pointing to local Docker registry. By running a local Docker registry mirror, you can keep most of the redundant image fetch traffic on your local network and speedup service deployment.

Docker Swarm cluster bootstrap script

I’ve prepared a shell script to bootstrap 4 nodes swarm cluster with Docker registry mirror and very nice swarm visualizer application.

The script initialize docker engine as a swarm master, then starts 3 new docker-in-docker containers and join them to the swarm cluster as worker nodes. All worker nodes run with --registry-mirror option.

Deploy multi-container application - the “old” way

The Docker compose is a tool (and deployment specification format) for defining and running composed multi-container Docker applications. Before Docker 1.12, you could use docker-compose tool to deploy such applications to a swarm cluster. With 1.12 release, it’s not possible anymore: docker-compose can deploy your application only on single Docker host.

In order to deploy it to a swarm cluster, you need to create a special deployment specification file (also knows as Distribution Application Bundle) in dab format (see more here).

The way to create this file, is to run the docker-compose bundle command. The output of this command is a JSON file, that describes multi-container composed application with Docker images referenced by @sha256 instead of tags. Currently dab file format does not support multiple settings from docker-compose.yml and does not allow to use supported options from docker service create command.

Such a pity story: the dab bundle format looks promising, but currently is totally useless (at least in Docker 1.12).

Deploy multi-container application - the “new” way

With Docker 1.13, the “new” way to deploy a multi-container composed application is to use docker-compose.yml again (hurrah!). Kudos to Docker team!

*Note: And you do not need the docker-compose tool, only yaml file in docker-compose format (version: "3")

$ docker deploy --compose-file docker-compose.yml myapp

Docker compose v3 (version: "3")

So, what’s new in docker compose version 3?

First, I suggest you take a deeper look at docker-compose schema. It is an extension of well-known docker-compose format.

Note: docker-compose tool (ver. 1.9.0) does not support docker-compose.yaml version: "3" yet.

The most visible change is around swarm ***service deployment. Now you can specify all options supported by docker service create/update commands:

  • number of service replicas (or global service)
  • service labels
  • hard and soft limits for service (container) CPU and memory
  • service restart policy
  • service rolling update policy
  • deployment placement constraints link

Docker compose v3 example

I’ve created a “new” compose file (v3) for classic “Cats vs. Dogs” example. This example application contains 5 services with following deployment configurations:

  1. voting-app - a Python webapp which lets you vote between two options; requires redis
  2. redis - Redis queue which collects new votes; deployed on swarm manager node
  3. worker .NET worker which consumes votes and stores them in db;
    • # of replicas: 2 replicas
    • hard limit: max 25% CPU and 512MB memory
    • soft limit: max 25% CPU and 256MB memory
    • placement: on swarm worker nodes only
    • restart policy: restart on-failure, with 5 seconds delay, up to 3 attempts
    • update policy: one by one, with 10 seconds delay and 0.3 failure rate to tolerate during the update
  4. db - Postgres database backed by a Docker volume; deployed on swarm manager node
  5. result-app Node.js webapp which shows the results of the voting in real time; 2 replicas, deployed on swarm worker nodes

Run the docker deploy --compose-file docker-compose.yml command to deploy my version of “Cats vs. Dogs” application on a swarm cluster.


Hope you find this post useful. I look forward to your comments and any questions you have.