Last updated: 6/5/2023

DIY Docker

linux namespaces

docker and others(lxc, podman...) are progams that allows the creation of lightweight virtualization, as opposed to real virtualization (type 1) it's software based emulated by the kernel, through a feature called namespaces in linux. it makes the process think it's in a separate machine with different settings (mount points, time, user id, ipc, pid, network interfaces, cpu count/memory amount, hostname), and the process can manipulate it's environment however it likes (with proper capabilities of course) like a real root.

1. Inro to Namespaces

a namespace is a wrapper of a set of system resources, each namespace type controls different type of resources. in linux the types of namespaces are:

a. User

this namespace isolates user/group id, so that a process can have different user/group id inside and outside the namespace.

b. UTS

uts namespace controls the hostname.

c. Time

time namespace controls the monotonic clock, and the boot time

d. Mount

mount namespace controls mount points so that processes could have different view of the filesystem tree.

e. IPC

controls POSIX and systemv ipc.

f. CGroup

controls the set of resources available to the processes within the namespace like memory and cpu.

g. Network

have control over network interfaces, ports.

h. PID

creates a new pid tree for the processes within the namespace.

2. Build Root Image

before creating the namespaces we first need a root filesystem where our processes will live and will be their root. we will use buildroot to build a simple image:

git clone --depth 1 git://git.buildroot.net/buildroot buildroot

(cd buildroot; make defconfig; make -j`nproc`)

3. Setup Namespaces & Exec Init

none of the subsequent operations requires root privileges.

to setup the new namespaces in c:

unshare(CLONE_NEWTIME
	|CLONE_NEWNS
	|CLONE_NEWCGROUP
	|CLONE_NEWUTS
	|CLONE_NEWIPC
	|CLONE_NEWUSER
	|CLONE_NEWPID
	|CLONE_NEWNET);

int ret = fork();

// exit parent
if (ret != 0)
	exit(0);

// init boot time 
int fd = open("/proc/self/timens_offsets", O_WRONLY);
// replace 12685 with time found in /proc/uptime 
write(fd, "7 -12685 0", 10);
close(fd);

// map user id 1000 to 0 (root)
fd = open("/proc/self/uid_map", O_WRONLY);
write(fd, "0 1000 1", 8);
close(fd);

// map group id 1000 to 0 (root)
fd = open("/proc/self/gid_map", O_WRONLY);
write(fd, "0 1000 1", 8);
close(fd);

// chdir to the buildroot filesystem
chroot("buildroot/output/target/");
chdir("/");
mount("proc", "/proc", "proc",
	MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL);
execve("/bin/sh", ["/bin/sh"], envp);

the most important part is the unshare(2) call wich informs the kernel to create new namespaces for the current process. this may seem like a lot of work but there is a tool that will effectively do the same it's a wrapper arround unshare(2):

unshare --fork --user --mount --uts --ipc \
		--net --pid --cgroup --time \
 		--map-root-user \
		--root=buildroot/output/target/ \
		--boottime=-`cat /proc/uptime \
			|cut -d' ' -f1 \
			|cut -d'.' -f1` \
		--mount-proc /bin/sh

the results should look like this:

the process is bound to the given resources and won't be able to escape the jail.

4. Networking

it would be nice if our "container" could connect to the outside world, so let's setup some basic networking:

a. virtual interface

as mentioned above having a separate net namespace means having different network devices, ips ..., so we need to create a virtual interface that will link our container with the host machine, which will act as a router:

host:

doas ip link add veth0 type veth peer name veth1

the host will have veth0 and the container will have veth1. now we need to move veth1 to the container, but before we do that we have to register our net namespace in ip tool:

doas ip netns attach diy-container <pid of a process within the container>

move veth1 to the container:

doas ip link set veth1 netns diy-container

from here there is two possible solutions the first is to to bridge container interface with the hardware interface, the second one is to use network address translation, and route packets through the host.

Solution #1: BRIDGE

this is more easier to configure and the recommended approach for its transparency:

bridge illustration

b. setup a bridge

we will create a bridge that will link the physical interface with the virtual interface of the container. first take note of the current network status:

host:

# save the ip address:
# replace enp0s1 with the interface connected to internet
ip addr show enp0s1
# save the gateway:
ip route show dev enp0s1

For example, this is the relevant info:

IP address attached to enp0s1: 10.2.3.4/24 Default gateway: 10.2.3.1

now let's create the bridge and configure it.

ip link add br0 type bridge
ip link set dev br0 up
ip address add 10.2.3.4/24 dev br0
ip route add default via 10.2.3.1 dev br0

link the veth with physical interface:

ip link set enp0s1 master br0
ip address del 10.2.3.4/24 dev enp0s1
ip link set veth0 master br0

now the container has physical access to the network just like the host; it just need to be configured:

container:

# replace 10.2.3.5 with an available ip
ip address add 10.2.3.5/24 dev veth1
ip route add default via 10.2.3.1 dev veth1

now we have full access to the physical network:

ping 8.8.8.8

PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=118 time=201.878 ms
64 bytes from 8.8.8.8: seq=1 ttl=118 time=214.774 ms
64 bytes from 8.8.8.8: seq=2 ttl=118 time=112.869 ms

Solution #2: NAT

b. connect the host and the container

now if we check the list of devices inside the container we should get:

container:

ip link

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: veth1@if5: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state DOWN qlen 1000
    link/ether da:2b:31:d0:b0:39 brd ff:ff:ff:ff:ff:ff

setup the address:

ip addr add 10.0.0.2/24 dev veth1
ip link set veth1 up

we should do the same in the host:

host:

doas ip addr add 10.0.0.1/24 dev veth0
doas ip link set veth0 up

now we have a connection, we can ping both ways:

ping 10.0.0.2

PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.053 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.057 ms

container:

ping 10.0.0.1

PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.053 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.057 ms

c. route container packets

finally we have to allow container packet to go through the network if needed:

host:

allow ip forwarding:

echo 1 | doas tee /proc/sys/net/ipv4/ip_forward

add NAT:

# replace enp0s1 with net interface connected to the network
doas iptables -t nat -A POSTROUTING -o enp0s1 -j MASQUERADE

container:

finally set the default gateway of the container:

ip route add default via 10.0.0.1 dev veth1

now we can reach the network from the container:

ping 8.8.8.8

PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=118 time=201.878 ms
64 bytes from 8.8.8.8: seq=1 ttl=118 time=214.774 ms
64 bytes from 8.8.8.8: seq=2 ttl=118 time=112.869 ms