You are on page 1of 71

Docker & Linux Containers

Araf Karsh Hamid


Topics 2

1. Docker
1. Docker Container
2. Docker Key Concepts
3. Docker Internals
4. Docker Architecture Linux Vs. OS X
5. Docker Architecture Windows
6. Docker Architecture Linux (Docker Daemon and Client)
7. Anatomy of Dockerfile
8. Building a Docker Image
9. Creating and Running a Docker Container
10. Invoking Docker Container using Java ProcessBuilder

2. Linux Containers
Docker containers are Linux Containers 3

DOCKER
CGROUPS NAMESPACES IMAGES
CONTAINER

• Kernel Feature • The real magic behind • Not a File System


• Groups of Processes containers • Not a VHD
• Control Resource • It creates barriers between • Basically a tar file
Allocation processes • Has a Hierarchy
• CPU, CPU Sets • Different Namespaces • Arbitrary Depth
• Memory • PID Namespace • Fits into Docker Registry
• Disk • Net Namespace
• Block I/O • IPC Namespace
• MNT Namespace
• Linux Kernel Namespace
introduced between kernel
2.6.15 – 2.6.26

docker run lxc-start


Docker Key Concepts 4
• Docker images
• A Docker image is a read-only template.
• For example, an image could contain an Ubuntu operating system with Apache and your
web application installed.
• Images are used to create Docker containers.
• Docker provides a simple way to build new images or update existing images, or you can
download Docker images that other people have already created.
• Docker images are the build component of Docker.
• Docker containers
• Docker containers are similar to a directory.
• A Docker container holds everything that is needed for an application to run.
• Each container is created from a Docker image.
• Docker containers can be run, started, stopped, moved, and deleted.
• Each container is an isolated and secure application platform.
• Docker containers are the run component of Docker.

• Docker Registries
• Docker registries hold images.
• These are public or private stores from which you upload or download images.
• The public Docker registry is called Docker Hub.
• It provides a huge collection of existing images for your use.
• These can be images you create yourself or you can use images that others have
previously created.
• Docker registries are the distribution component of Docker.
Docker Architecture Linux Vs. OS X 5

• In an OS X installation, the docker daemon is running inside a Linux virtual


machine provided by Boot2Docker.

• In OS X, the Docker host address is the address of the Linux VM. When you start
the boot2docker process, the VM is assigned an IP address. Under boot2docker
ports on a container map to ports on the VM.
Docker – Somewhere in the Future ……  6

Docker Running natively in Windows!


Docker Architecture – Linux 7
• Docker Daemon
• Docker daemon, which does the heavy lifting of
building, running, and distributing your Docker
containers.
• Both the Docker client and the daemon can run on
the same system, or you can connect a Docker client
to a remote Docker daemon.
• The Docker client and daemon communicate via
sockets or through a RESTful API.

• Docker Client (docker) Commands


• search (Search images in the Docker
Repository)
• pull (Pull the Image)
• run (Run the container)
• create (Create the container)
• build (build an image using Dockerfile)
• images (Shows images)
• push (Push the container to Docker
Repository)
• import / export
• start (start a stopped container)
• stop (stop a container)
• restart (Restart a container)
• save (Save an image to a tar archive) Examples
• exec (Run a command in a running
container) $ docker search applifire
• top (Look at the running process in a
container) $ docker pull applifire/jdk:7
• ps (List the containers) $ docker images
• attach (Attach to a running Container)
• diff (Inspect changes to a containers file
$ docker run –it applifire/jdk:7 /bin/bash
system)
To pull an image
Docker client examples docker pull applifire/tomcat
8

Searching in
the docker
registry for
images.

Images in your
local registry
after the build
or directly
pulled from
docker
registry.
Analyzing “docker run –it ubuntu /bin/bash” command 9

In order, Docker does the following:

1. Pulls the ubuntu image:


• Docker checks for the presence of the ubuntu image and, if it doesn't exist
locally on the host, then Docker downloads it from Docker Hub.
• If the image already exists, then Docker uses it for the new container.
• Creates a new container: Once Docker has the image, it uses it to create a
container.
2. Allocates a filesystem and mounts a read-write layer:
• The container is created in the file system and a read-write layer is added to
the image.
3. Allocates a network / bridge interface:
• Creates a network interface that allows the Docker container to talk to the
local host.
4. Sets up an IP address:
• Finds and attaches an available IP address from a pool.
5. Executes a process that you specify:
• Runs your application, and;
6. Captures and provides application output:
• Connects and logs standard input, outputs and errors for you to see how
your application is running.
Anatomy of a Dockerfile 10

Command Description Example


The FROM instruction sets the Base Image for subsequent instructions. As
such, a valid Dockerfile must have FROM as its first instruction. The image FROM ubuntu
FROM can be any valid image – it is especially easy to start by pulling an image FROM applifire/jdk:7
from the Public repositories
The MAINTAINER instruction allows you to set the Author field of the
MAINTAINER generated images.
MAINTAINER arafkarsh

The LABEL instruction adds metadata to an image. A LABEL is a key-value


LABEL version="1.0”
LABEL pair. To include spaces within a LABEL value, use quotes and blackslashes as
LABEL vendor=“Algo”
you would in command-line parsing.
The RUN instruction will execute any commands in a new layer on top of the
RUN current image and commit the results. The resulting committed image will RUN apt-get install -y curl
be used for the next step in the Dockerfile.
The ADD instruction copies new files, directories or remote file URLs from ADD hom* /mydir/
ADD <src> and adds them to the filesystem of the container at the path <dest>. ADD hom?.txt /mydir/

The COPY instruction copies new files or directories from <src> and adds COPY hom* /mydir/
COPY them to the filesystem of the container at the path <dest>. COPY hom?.txt /mydir/

The ENV instruction sets the environment variable <key> to the value
ENV JAVA_HOME /JDK8
ENV <value>. This value will be in the environment of all "descendent" Dockerfile
ENV JRE_HOME /JRE8
commands and can be replaced inline in many as well.
The EXPOSE instructions informs Docker that the container will listen on the
specified network ports at runtime. Docker uses this information to
EXPOSE interconnect containers using links and to determine which ports to expose
EXPOSE 8080
to the host when using the –P flag with docker client.
Anatomy of a Dockerfile 11

Command Description Example


The VOLUME instruction creates a mount point with the specified
name and marks it as holding externally mounted volumes from
VOLUME native host or other containers. The value can be a JSON array, VOLUME /data/webapps
VOLUME ["/var/log/"], or a plain string with multiple arguments,
such as VOLUME /var/log or VOLUME /var/log
The USER instruction sets the user name or UID to use when running
USER the image and for any RUN, CMD and ENTRYPOINT instructions that USER applifire
follow it in the Dockerfile.
The WORKDIR instruction sets the working directory for any RUN,
WORKDIR CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the WORKDIR /home/user
Dockerfile.
There can only be one CMD instruction in a Dockerfile. If you list
more than one CMD then only the last CMD will take effect.
The main purpose of a CMD is to provide defaults for an executing CMD echo "This is a test."
CMD container. These defaults can include an executable, or they can omit | wc -
the executable, in which case you must specify an ENTRYPOINT
instruction as well.
An ENTRYPOINT allows you to configure a container that will run as
an executable. Command line arguments to docker run <image> will
be appended after all elements in an exec form ENTRYPOINT, and will
ENTRYPOINT override all elements specified using CMD. This allows arguments to ENTRYPOINT ["top", "-b"]
be passed to the entry point, i.e., docker run <image> -d will pass the
-d argument to the entry point. You can override the ENTRYPOINT
instruction using the docker run --entrypoint flag.
Building a Docker image : Base Ubuntu 12

• Dockerfile (Text File)


1 • Create the Dockerfile

• Build image using Dockerfile

• The following command will


build the docker image
based on the Dockerfile.
• Docker will download any
required build automatically
from the Docker registry.
2• docker build –t
applifire/ubuntu .
• This will build a base ubuntu
with enough Linux utilities
for the development
environment.
Building a Docker Image : Java 8 (JRE) + Tomcat 8 13

1• Dockerfile (Text File)

1. Create the Java (JRE8) Dockerfile


with Ubuntu as the base image.

2. Create the Tomcat Dockerfile with


JRE8 as the base image.

2• Build image using Dockerfile

1. Build Java 8 (JRE) Docker Image


docker build –t applifire/jre:8 .

1. Build Tomcat 8 Docker Image


docker build –t applifire/tomcat:jre8 .
Building a Docker Image : Java 7 (JDK) + Gradle 2.3 14

1• Dockerfile (Text File)

1. Create the Java (JDK7) Dockerfile


with Ubuntu as the base image.

2. Create the Gradle Dockerfile with


Java (JDK7) as the base image.

2• Build image using Dockerfile

1. Build Java 7 (JDK) Docker Image


docker build –t applifire/jdk:7 .

1. Build Gradle 2.3 Docker Image


docker build –t applifire/gradle:jdk7 .
Creating & Running Docker Containers 15

docker run Example


To run servers like Tomcat, Apache Web
-d Detached mode
Server
Publish Container’s Port
IP:hostport:ContainerPort 192.a.b.c:80:8080
-p
IP::ContainerPort 192.a.b.c::8080
HostPort:ContainerPort 8081:8080

When you want to log into the container.


This mode works fine from a Unix Shell.
-it Run Interactive Mode However, ensure that you don’t use this
mode when running it through the
ProcessBuilder in Java.
-v Mount Host File System -v host-file-system:container-file-system
-name Name to the container
-w Working Directory Working Directory for the Container
User Name with which you can log into the
-u User Name
container.
Creating & Running Docker Containers - Advanced 16

docker run Example


--cpuset=“” CPUs in which to allow execution 0-3, 0,1
Number & Unit (b, k, m, g)
-m Memory Limit for the Container
1g = 1GB, 1m = 1MB, 1k = 1KB
--memory- Total Memory Usage (Memory + Number & Unit (b, k, m, g)
swap Swap Space) 1g = 1GB
-e Set Environment Variables
When your Tomcat Container wants
--link=[] Link another container
to talk to MySQL DB container.
--ipc Inter Process Communication
--dns Set Custom DNS Servers
--dns-search Set Custom DNS Search Domains
-h Container Host Name --hostname=“voldermort”
Expose a container port or a --expose=8080-8090
--expose=[]
range of ports
--add-host Add a custom host-IP mapping Host:IP
Docker Container Process Management 17

docker ps
-a Show all the containers. Only running containers are shown by default
-q Only display the Numeric IDs
-s Display the total file sizes
Provide Filters to show containers.
-f -f status=exited
-f exited=100
-l Show only the latest Container.

Starts a stopped Container. For example Tomcat Server


docker start
Ex. docker start containerName
Stops a container. Start and Stop is mainly used for detached
docker stop containers like Tomcat, MySQL, and Apache Web Server Containers.
Ex. docker stop containerName
Restart a Container
docker restart
Ex. docker restart containerName
Docker Container Management – Short Cuts 18

Remove all Exited Containers


docker rm containerId / name Removes the Exited Container
docker rm $(docker ps –aq) docker ps –aq : returns all the container ID in
exited state into $ and then docker rm
command will remove the exited containers.
docker stop To remove a running container, you need to
stop the container first.
Ex. Tomcat Server Running.
docker stop containerName

Remove Docker Image


docker rmi imageId Removes the Docker Image
Remove all Docker images with <none> tag
* docker rmi $(docker images | grep "^<none>" | tr -s '' | awk -F ' ' '{print $3}')

* This command can be made even better…. 


Invoking Docker Container using Java ProcessBuilder API 19

When you execute docker command using Java ProcessBuilder API never use
run with –it (for interactive and terminal). This will block the container from
exiting, unless you want to have an interactive session.. 
Ex. docker run applifire/maven:jdk7 pom.xml
If you are using a shell script to invoke the docker container then refer the
following to handle Linux and OS X environments.

Boot2Docker
Settings for
OS X

$? to get the
exit code of
previous
command
LinuX Container
© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations :
By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
21
What’s Linux container
 Linux Containers (LXC for LinuX Containers) are

 Lightweight virtual machines (VMs)


 Which are realized using features provided by a modern Linux
kernel –
 VMs without the hypervisor

 Containerization of:

 (Linux) Operating Systems


 Single or multiple applications (Tomcat, MySQL DB etc.,)
Why LXC? 22

“Linux Containers as poised as the next VM in our modern Cloud era…”


Provision in seconds / milliseconds
Provision Time
Near bare metal runtime performance
Days
VM-like agility – it’s still “virtualization” Minutes
Seconds / ms
Flexibility
• Containerize a “system” Manual VM LXC
• Containerize “application(s)”
linpack performance @ 45000
Lightweight 250

• Just enough Operating System (JeOS) 200

GFlops
150
• Minimal per container penalty 100

50

Open source – free – lower TCO 0

11

13

15

17

19

21

23

25

27

29

31

BM
Supported with OOTB modern Linux kernel vcpus

Growing in popularity

Google trends - LXC Google trends - docker

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Hypervisors vs. Linux containers 23

Containers share the OS kernel of the host and thus are lightweight.
However, each container must have the same OS kernel.
Containers are isolated,
but share OS and,
where appropriate, libs /
bins.

Ap Ap Ap Ap
p p p p
Ap Ap Ap Ap
Bins / libs Bins / libs
p p p p
Operating Operating Ap Ap Ap Ap
Bins / libs Bins / libs System System p p p p
Operating Operating Virtual Machine Virtual Machine
System System Bins / libs Container
Virtual Machine Virtual Machine Hypervisor Container
Bins / libs

Hypervisor Operating System Operating System


Hardware Hardware Hardware

Type 1 Hypervisor Type 2 Hypervisor Linux Containers

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Technology Stack 24

LXCs are built on modern kernel features


• cgroups; limits, prioritization, accounting & control
• namespaces; process based resource isolation
• chroot; apparent root FS directory
• Linux Security Modules (LSM); Mandatory Access Control (MAC)
User space interfaces for kernel functions
LXC tools
• Tools to isolate process(es) virtualizing kernel resources
LXC commoditization
• Dead easy LXC
• LXC virtualization
Orchestration & management
• Scheduling across multiple hosts
• Monitoring
• Uptime

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux cgroups 25

History
• Work started in 2006 by google engineers
• Merged into upstream 2.6.24 kernel due to wider spread LXC usage
• A number of features still a WIP
Functionality
• Access; which devices can be used per cgroup
• Resource limiting; memory, CPU, device accessibility, block I/O, etc.
• Prioritization; who gets more of the CPU, memory, etc.
• Accounting; resource usage per cgroup
• Control; freezing & check pointing
• Injection; packet tagging
Usage
• cgroup functionality exposed as “resource controllers” (aka “subsystems”)
• Subsystems mounted on FS
• Top-level subsystem mount is the root cgroup; all procs on host
• Directories under top-level mounts created per cgroup
• Procs put in tasks file for group assignment
• Interface via read / write pseudo files in group

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux cgroup subsystems 26

cgroups provided via kernel modules


• Not always loaded / provided by default
• Locate and load with modprobe
Some features tied to kernel version
See: https://www.kernel.org/doc/Documentation/cgroups/

Subsystem Tunable Parameters


blkio - Weighted proportional block I/O access. Group wide or per device.
- Per device hard limits on block I/O read/write specified as bytes per second or IOPS per second.
cpu - Time period (microseconds per second) a group should have CPU access.
- Group wide upper limit on CPU time per second.
- Weighted proportional value of relative CPU time for a group.
cpuset - CPUs (cores) the group can access.
- Memory nodes the group can access and migrate ability.
- Memory hardwall, pressure, spread, etc.
devices - Define which devices and access type a group can use.
freezer - Suspend/resume group tasks.
memory - Max memory limits for the group (in bytes).
- Memory swappiness, OOM control, hierarchy, etc..
hugetlb - Limit HugeTLB size usage.
- Per cgroup HugeTLB metrics.
net_cls - Tag network packets with a class ID.
- Use tc to prioritize tagged packets.
net_prio - Weighted proportional priority on egress traffic (per interface).

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux cgroups FS layout 27

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux cgroups Pseudo FS Interface 28

Linux pseudo FS is the interface to cgroups


• Read / write to pseudo file(s) in your cgroup directory
Some libs exist to interface with pseudo FS programmatically
/sys/fs/cgroup/my-lxc

|-- blkio
| |-- blkio.io_merged
| |-- blkio.io_queued
| |-- blkio.io_service_bytes
| |-- blkio.io_serviced
| |-- blkio.io_service_time
| |-- blkio.io_wait_time echo "8:16 1048576“ >
| |-- blkio.reset_stats blkio.throttle.read_bps_device
| |-- blkio.sectors
| |-- blkio.throttle.io_service_bytes
| |-- blkio.throttle.io_serviced
| |-- blkio.throttle.read_bps_device
| |-- blkio.throttle.read_iops_device
App
| |-- blkio.throttle.write_bps_device
| |-- blkio.throttle.write_iops_device
| |-- blkio.time cat blkio.weight_device
dev weight
| |-- blkio.weight
8:1 200
App
| |-- blkio.weight_device
| |-- cgroup.clone_children 8:16 500
| |-- cgroup.event_control App
| |-- cgroup.procs
| |-- notify_on_release
| |-- release_agent
| `-- tasks
|-- cpu
| |-- ...
|-- ...
`-- perf_event

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux cgroups: CPU Usage 29

Use CPU shares (and other controls) to prioritize jobs / containers


Carry out complex scheduling schemes
Segment host resources
Adhere to SLAs

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux cgroups: CPU Pinning 30

Pin containers / jobs to CPU cores


Carry out complex scheduling schemes
Reduce core switching costs
Adhere to SLAs

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux cgroups: Device Access 31

Limit device visibility; isolation


Implement device access controls
• Secure sharing
Segment device access
Device whitelist / blacklist

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Realization: Linux cgroups 32

cgroup created per container (in each cgroup subsystem)


Prioritization, access, limits per container a la cgroup controls
Per container metrics (bean counters)

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux namespaces 33

History
• Initial kernel patches in 2.4.19
• Recent 3.8 patches for user namespace support
• A number of features still a WIP
Functionality
• Provide process level isolation of global resources
• MNT (mount points, file systems, etc.)
• PID (process)
• NET (NICs, routing, etc.)
• IPC (System V IPC resources)
• UTS (host & domain name)
• USER (UID + GID)
• Process(es) in namespace have illusion they are the only processes on the
system
• Generally constructs exist to permit “connectivity” with parent namespace
Usage
• Construct namespace(s) of desired type
• Create process(es) in namespace (typically done when creating namespace)
• If necessary, initialize “connectivity” to parent namespace
• Process(es) in name space internally function as if they are only proc(s) on
system

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux namespaces: Conceptual Overview 34

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux namespaces: MNT namespace 35

Isolates the mount table – per namespace “global” (i.e. root)


namespace
mounts
MNT NS
/
mount / unmount operations isolated to /proc
/mnt/fsrd
namespace /mnt/fsrw
/mnt/cdrom
Mount propagation /run2

• Shared; mount objects propagate events to


“green”
one another namespace
• Slave; one mount propagates events to MNT NS
another, but not vice versa /
/proc

• Private; no event propagation (default) /mnt/greenfs


/mnt/fsrw
/mnt/cdrom
Unbindable mount forbids bind mounting itself
Various tools / APIs support the mount
namespace such as the mount command “red” namespace

• Options to make shared, private, slave, etc. MNT NS


/
• Mount with namespace support /proc
/mnt/cdrom
/redns
Typically used with chroot or pivot_root for
effective root FS isolation
© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux namespaces: UTS namespace 36

Per namespace “global” (i.e. root)


namespace

• Hostname UTS NS
globalhost
• NIS domain name rootns.com

Reported by commands such as hostname


Processes in namespace can change UTS
values – only reflected in the child “green”
namespace
namespace UTS NS
greenhost
Allows containers to have their own FQDN greenns.org

“red” namespace

UTS NS
redhost
redns.com

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux namespaces: PID namespace 37

Per namespace PID mapping “global” (i.e. root)


namespace

• PID 1 in namespace not the same as PID 1 in PID NS


PID COMMAND
parent namespace 1 /sbin/init
2 [kthreadd]
• No PID conflicts between namespaces 3 [ksoftirqd]
4 [cpuset]
5 /sbin/udevd
• Effectively 2 PIDs; the PID in the namespace
and the PID outside the namespace “green”
namespace
Permits migrating namespace processes
PID NS
between hosts while keeping same PID PID COMMAND
1 /bin/bash
Only processes in the namespace are visible 2 /bin/vim

within the namespace (visibility limited)

“red” namespace

PID NS
PID COMMAND
1 /bin/bash
2 python
3 node

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux namespaces: IPC namespace 38

System V IPC object & POSIX message queue “global” (i.e. root)
namespace
isolation between namespaces IPC NS
SHMID OWNER

• Semaphores 32452
43321
root
boden

• Shared memory SEMID


0
OWNER
root
1 boden
• Message queues
Parent namespace connectivity “green”
namespace

• Signals IPC NS

• Memory polling SHMID OWNER

SEMID OWNER
• Sockets (if no NET namespace) 0 root

• Files / file descriptors (if no mount namespace)


• Events over pipe pair “red” namespace

IPC NS
SHMID OWNER

SEMID OWNER

MSQID OWNER

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux namespaces: NET namespace 39

Per namespace network objects “global” (i.e. root)


namespace

• Network devices (eths) NET NS


lo: UNKNOWN…
eth0: UP…
• Bridges eth1: UP…
br0: UP…

• Routing tables app1 IP:5000


app2 IP:6000
• IP address(es) app3 IP:7000

• ports “green”
namespace
• Etc NET NS

Various commands support network namespace lo: UNKNOWN…


eth0: UP…

such as ip app1 IP:1000


app2 IP:7000

Connectivity to other namespaces


• veths – create veth pair, move one inside the “red” namespace

namespace and configure NET NS

• Acts as a pipe between the 2 namespaces lo: UNKNOWN…


eth0: DOWN…
eth1: UP
LXCs can have their own IPs, routes, bridges, etc.
app1 IP:7000
app2 IP:9000

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux namespaces: USER namespace 40

A long work in progress – still development for XFS “global” (i.e. root)
namespace
and other FS support
USER NS
• Significant security impacts root 0:0
ntp 104:109
• A handful of security holes already found + fixed Mysql 105:110
boden 106:111
Two major features provided:
• Map UID / GID from outside the container to UID /
“green”
GID inside the container namespace
• Permit non-root users to launch LXCs USER NS
• Distro’s rolling out phased support, with UID / GID root 0:0

mapping typically 1st app 106:111

First process in USER namespace has full CAPs;


perform initializations before other processes are
created “red” namespace

• No CAPs in parent namespace


USER NS
UID / GID map can be pre-configured via FS root 0:0
app 104:109
Eventually USER namespace will mitigate many
perceived LXC security concerns

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Realization: Linux namespaces 41

A set of namespaces created for the container


Container process(es) “executed” in the namespace set
Process(es) in the container have isolated view of resources
Connectivity to parent where needed (via lxc tooling)

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux namespaces & cgroups: Availability 42

Note: user namespace support


in upstream kernel 3.8+, but
distributions rolling out phased
support:
- Map LXC UID/GID between
container and host
- Non-root LXC creation
Linux chroots 43

Changes apparent root directory for process and children


• Search paths
• Relative directories
• Etc
Using chroot can be escaped given proper capabilities, thus pivot_root is
often used instead
• chroot; points the processes file system root to new directory
• pivot_root; detaches the new root and attaches it to process root
directory
Often used when building system images
• Chroot to temp directory
• Download and install packages in chroot
• Compress chroot as a system root FS
LXC realization
• Bind mount container root FS (image)
• Launch (unshare or clone) LXC init process in a new MNT namespace
• pivot_root to the bind mount (root FS)

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux chroot vs pivot_root 44

Using pivot_root with MNT namespace addresses escaping


chroot concerns
The pivot_root target directory becomes the “new root FS”

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Realization: Images 45

LXC images provide a flexible means to deliver only what you need – lightweight and minimal footprint

Basic constraints
• Same architecture
• Same endian
• Linux’ish Operating System; you can run different Linux distros on same host
Image types
• System; images intended to virtualize Operating System(s) – standard distro root FS less
the kernel
• Application; images intended to virtualize application(s) – only package apps +
dependencies (aka JeOS – Just enough Operating System)
Bind mount host libs / bins into LXC to share host resources
Container image init process
• Container init command provided on invocation – can be an application or a full fledged init
process
• Init script customized for image – skinny SysVinit, upstart, etc.
• Reduces overhead of lxc start-up and runtime foot print
Various tools to build images
• SuSE Kiwi
• Debootstrap
• Etc.
LXC tooling options often include numerous image templates

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux Security Modules & MAC 46

Linux Security Modules (LSM) – kernel modules which provide a framework for
Mandatory Access Control (MAC) security implementations
MAC vs DAC
• In MAC, admin (user or process) assigns access controls to subject / initiator
• Most MAC implementations provide the notion of profiles
• Profiles define access restrictions and are said to “confine” a subject
• In DAC, resource owner (user) assigns access controls to individual resources
Existing LSM implementations include: AppArmor, SELinux, GRSEC, etc.

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Linux Capabilities & Other Security Measures 47

Linux capabilities
• Per process privileges which define operational (sys call) access
• Typically checked based on process EUID and EGID
• Root processes (i.e. EUID = GUID = 0) bypass capability checks
Capabilities can be assigned to LXC processes to restrict
Other LXC security mitigations
• Reduce shared FS access using RO bind mounts
• Keep Linux kernel up to date
• User namespaces in 3.8+ kernel
• Allow to launch containers as non-root user
• Map UID / GID inside / outside of container

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Realization 48

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Tooling 49

LXC is not a kernel feature – it’s a technology enabled via kernel


features
• User space tooling required to manage LXCs effectively
Numerous toolsets exist
• Then: add-on patches to upstream kernel due to slow kernel
acceptance
• Now: upstream LXC feature support is growing – less need for
patches
More popular GNU Linux toolsets include libvirt-lxc and lxc (tools)
• OpenVZ is likely the most mature toolset, but it requires kernel
patches
• Note: I would consider docker a commoditization of LXC
Non-GNU Linux based LXC
• Solaris zones
• BSD jails
• Illumos / SmartOS (solaris derivatives)
• Etc.

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Industry Tooling 50

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Libvirt-lxc 51

 Perhaps the simplest to learn through a familiar virsh interface


 Libvirt provides LXC support by connecting to lxc:///
 Many virsh commands work
• virsh -c lxc:/// define sample.xml
• virsh –c lxc:/// start sample
• virsh –c lxc:/// console sample
• virsh –c lxc:/// shutdown sample
• virsh –c lxc:/// undefine sample
<domain type='lxc'>
<name>sample</name>

No snapshotting, templates…<memory>32768</memory>
<os> <type>exe</type> <init>/init</init> </os>
<vcpu>1</vcpu>
<clock offset='utc'/>
OpenStack support since Grizzly <on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
 No VNC <devices>
 No Cinder support in Grizzly <emulator>/usr/libexec/libvirt_lxc</emulator>
<filesystem type='mount'> <source dir='/opt/vm-1-root'/> <target dir='/'/> </filesystem>
 Config drive not supported <interface type='network'> <source network='default'/> </interface>
<console type='pty' />
 Alternative means of accessing metadata</devices>
 Attached disk rather than http calls </domain>

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC (tools) 52

A little more functionality


Supported by the major distributions
LXC 1.0 recently released

• Cloning supported: lxc-clone


• Templates… btrfs
• lxc-create -t ubuntu -n CN creates a new ubuntu
container
• “template” is downloaded from Ubuntu
• Some support for Fedora <= 14
• Debian is supported
• lxc-start -d -n CN starts the container
• lxc-destroy -n CN destroys the container
• /etc/lxc/lxc.conf has default settings
• /var/lib/lxc/CN is the default place for each container

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Commoditization: docker 53

Young project with great vibrancy in the industry


Currently based on unmodified LXC – but the goal is to make it dirt easy
As of March 10th, 2014 at v0.9. Monthly releases, 1.0 should be ready for production
use
What docker adds to LXC
• Portable deployment across machines
• In Cloud terms, think of LXC as the hypervisor and docker as the Open Virtualization Appliance (OVA) and the provision
engine
• Docker images can run unchanged on any platform supporting docker
• Application-centric
• User facing function geared towards application deployment, not VM analogs [!]
• Automatic build
• Create containers from build files
• Builders can use chef, maven, puppet, etc.
• Versioning support
• Think of git for docker containers
• Only delta of base container is tracked
• Component re-use
• Any container can be used as a base, specialized and saved
• Sharing
• Support for public/private repositories of containers
• Tools
• CLI / REST API for interacting with docker
• Vendors adding tools daily
Docker containers are self contained – no more “dependency hell”

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Docker vs. LXC vs. Hypervisor 54

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Docker: LXC Virtualization? 55

Docker decouples the LXC provider from the operations


• LXC provider agnostic
Docker “images” run anywhere docker is supported
• Portability

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Orchestration & Management 56

Docker & libvirt-lxc in OpenStack


• Manage containers heterogeneously with traditional VMs… but not w/the
level of support & features we might like
CoreOS
• Zero-touch admin Linux distro with docker images as the unit of operation
• Centralized key/value store to coordinate distributed environment
Various other 3rd party apps
• Maestro for docker
• Shipyard for docker
• Fleet for CoreOS
• Etc.
LXC migration
• Container migration via criu
But…
• Still no great way to tie all virtual resources together with LXC – e.g. storage
+ networking
• IMO; an area which needs focus for LXC to become more generally
applicable

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Docker in OpenStack 57

Introduced in Havana
• A nova driver to integrate with docker REST API
• A Glance translator to integrate containers with Glance
• A docker container which implements a docker registry API
The claim is that docker will become a “group A” hypervisor
• In it’s current form it’s effectively a “tech preview”

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Evaluation 58

Goal: validate the promise with an eye towards practical


applicability
Dimensions evaluated:
• Runtime performance benefits
• Density / footprint
• Workload isolation
• Ease of use and tooling
• Cloud Integration
• Security
• Ease of use / feature set

NOTE: tests performed in a passive manner – deeper analysis


warrented.

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Runtime Performance Benefits - CPU 59

Tested using libvirt lxc on Ubuntu 13.10 using linpack 11.1


Cpuset was used to limit the number of CPUs that the containers could use
The performance overhead falls within the error of measurement of this test
Actual bare metal performance is actually lower than some container results

linpack performance @ 45000

250

200
GFlops

150

100

50

0
1

11

13

15

17

19

21

23

25

27

29

31

BM
vcpus

220.9 220.77
@ 31 vcpu 220.5 Bare metal
@32 vcpu
© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Runtime Performance Benefits – I/O 60

I/O Tests using libvirt lxc show a < 1 % degradation


Tested with a pass-through mount

I/O throughput

2000 1711.2 1724.9


1626.4 1633.4
1500
MB/s

1000 Series1

500

0
lxc write bare metal lxc read bare metal
write read
test

Sync write I/O test Sync read I/O test

Rw=Write Rw=Write
Size=1024m
Size=1024m
Bs=128mb
Bs=128mb
direct=1
direct=1
sync=1
sync=1

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Runtime Performance Benefits – Block I/O 61

Tested with [standard] AUFS

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Density & Footprint – libvirt-lxc 62
Using libvirt lxc on RHEL 6.4, we found that empty container overhead was just 840 bytes. A container could be started in
about 330ms, which was an I/O bound process
This represents the lower limit of lxc footprint
Containers ran /bin/sh

Starting 500 containers


Mon Nov 11 13:38:49 CST 2013 ... all threads
done in 157
(sequential I/O bound)

Stopping 500 containers


Mon Nov 11 13:42:20 CST 2013 ... all threads
done in 162
Active memory delta: 417.2 KB

Starting 1000 containers


Mon Nov 11 13:59:19 CST 2013 ... all threads
done in 335

Stopping 1000 containers


Mon Nov 11 14:14:26 CST 2013 ... all threads
done in 339
Active memory delta: 838.4KB

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Density & Footprint – Docker 63

In this test, we created 150 Docker containers with CentOS,


started apache & then removed them
Average footprint was ~10MB per container Container Container
Creation Deletion
Average start time was 240ms

Serially booting 150 containers which run


apache
• Takes on average 36 seconds
• Consumes about 2 % of the CPU CPU profile
• Negligible HDD space
• Spawns around 225 processes for create
• Around 1.5 GB of memory ~ 10 MB per container
• Expect faster results once docker addresses performance topics in
the next few months

Serially destroying 150 containers


running apache
I/O profile
• On average takes 9 seconds
• We would expect destroy to be faster – likely a docker bug and will
triage with the docker community

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Workload Isolation: Examples 64

Using the blkio cgroup (lxc.cgroup.blkio.throttle.read_bps_device) to cap the I/O of a container


Both the total bps and iops_device on read / write could be capped
Better async BIO support in kernel 3.10+

We used fio with oflag=sync, direct to test the ability to cap the
reads:

• With limit set to 6 MB / second


READ: io=131072KB, aggrb=6147KB/s, minb=6295KB/s, maxb=6295KB/s, mint=21320msec,
maxt=21320msec

• With limit set to 60 MB / second


READ: io=131072KB, aggrb=61134KB/s, minb=62601KB/s, maxb=62601KB/s, mint=2144msec,
maxt=2144msec

• No read limit
READ: io=131072KB, aggrb=84726KB/s, minb=86760KB/s, maxb=86760KB/s, mint=1547msec,
maxt=1547msec

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
OpenStack VM Operations 65

NOTE: orchestration / management overheads cap LXC performance

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
Who’s Using LXC 66

Google app engine & infra is said to be using some form of LXC
RedHat OpenShift
dotCloud (now docker inc)
CloudFoundry (early versions)
Rackspace Cloud Databases
• Outperforms AWS (Xen) according to perf results

Parallels Virtuozzo (commercial product)


Etc..

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Gaps 67

There are gaps…

Lack of industry tooling / support


Live migration still a WIP
Full orchestration across resources (compute / storage /
networking)
Fears of security
Not a well known technology… yet
Integration with existing virtualization and Cloud tooling
Not much / any industry standards
Missing skillset
Slower upstream support due to kernel dev process
Etc.

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC: Use Cases For Traditional VMs 68

There are still use cases where traditional VMs are warranted.

Virtualization of non Linux based OSs


• Windows
• AIX
• Etc.
LXC not supported on host
VM requires unique kernel setup which is not applicable to
other VMs on the host (i.e. per VM kernel config)
Etc.

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Recommendations 69

Public & private Clouds


• Increase VM density 2-3x
• Accommodate Big Data & HPC type applications
• Move the support of Linux distros to containers
PaaS & managed services
• Realize “as a Service” and managed services using LXC
Operations management
• Ease management + increase agility of bare metal components
DevOps
Development & test
• Sandboxes
• Dev / test envs
• Etc.

If you are just starting with LXC and don’t have in-depth skillset
• Start with LXC for private solutions (trusted code)

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
LXC Resources 70

https://www.kernel.org/doc/Documentation/cgroups/
http://www.blaess.fr/christophe/2012/01/07/linux-3-2-cfs-cpu-bandwidth-english-version/
http://atmail.com/kb/2009/throttling-bandwidth/
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch-
Subsystems_and_Tunable_Parameters.html
http://www.janoszen.com/2013/02/06/limiting-linux-processes-cgroups-explained/
http://www.mattfischer.com/blog/?p=399
http://oakbytes.wordpress.com/2012/09/02/cgroup-cpu-allocation-cpu-shares-examples/
http://fritshoogland.wordpress.com/2012/12/15/throttling-io-with-linux/
https://lwn.net/Articles/531114/
https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
http://www.ibm.com/developerworks/library/l-mount-namespaces/
http://blog.endpoint.com/2012/01/linux-unshare-m-for-per-process-private.html
http://timothysc.github.io/blog/2013/02/22/perprocess/
http://www.evolware.org/?p=293
http://s3hh.wordpress.com/2012/05/10/user-namespaces-available-to-play/
http://libvirt.org/drvlxc.html
https://help.ubuntu.com/lts/serverguide/lxc.html
https://linuxcontainers.org/
https://wiki.ubuntu.com/AppArmor
http://linux.die.net/man/7/capabilities
http://docs.openstack.org/trunk/config-reference/content/lxc.html
https://wiki.openstack.org/wiki/Docker
https://www.docker.io/
http://marceloneves.org/papers/pdp2013-containers.pdf
http://openvz.org/Main_Page
http://criu.org/Main_Page

© Realizing Linux Containers (LXC) Building Blocks, Underpinnings & Motivations : By Boden Russell – IBM Technology Services (brussell@us.ibm.com)
71

Thank You…..

Araf Karsh Hamid


(araf.karsh@gmail.com)

Thanks to :
Boden Russell – IBM Technology Services
(brussell@us.ibm.com)
For the fantastic presentation on LinuX Containers

You might also like