Building an IPv6 only kubernetes cluster

Introduction

For a few weeks I am working on my pet project to create a production ready kubernetes cluster that runs in an IPv6 only environment.

As the complexity and challenges for this project are rather interesting, I decided to start documenting them in this blog post.

The ungleich-k8s contanins all snippets and latest code.

Objective

The kubernetes cluster should support the following work loads:

Matrix Chat instances (Synapse+postgres+nginx+element)
Virtual Machines (via kubevirt)
Provide storage to internal and external consumers using Ceph

Components

The following is a list of components that I am using so far. This might change on the way, but I wanted to list already what I selected and why.

OS: Alpine Linux

The operating system of choice to run the k8s cluster is Alpine Linux as it is small, stable and supports both docker and cri-o.

Container management: docker

Originally I started with cri-o. However using cri-o together with kubevirt and calico results in an overlayfs placed on / of the host, which breaks the full host functionality (see below for details).

Docker, while being deprecated, allows me to get kubevirt generally speaking running.

Networking: IPv6 only, calico

I wanted to go with cilium first, because it goes down the eBPF route from the get go. However cilium does not yet contain native and automated BGP peering with the upstream infrastructure, so managing nodes / ip network peering becomes a tedious, manual and error prone task. Cilium is on the way to improve this, but is not there yet.

Calico on the other hand still relies on ip(6)tables and kube-proxy for forwarding traffic, but has for a long time proper BGP support. Calico also aims to add eBPF support, however at the moment it does not support IPv6 yet (bummer!).

Storage: rook

Rook seems to be the first choice if you search who is doing what storage providers in the k8s world. It looks rather proper, even though some knobs are not yet clear to me.

Rook, in my opinion, is a direct alternative of running cephadm, which requires systemd running on your hosts. Which, given Alpine Linux, will never be the case.

Virtualisation

Kubevirt seems to provide a good interface. Mid term, kubevirt is projected to replace OpenNebula at ungleich.

Challenges

cri-o + calico + kubevirt = broken host

So this is a rather funky one. If you deploy cri-o and calico, everything works. If you then deploy kubevirt, the virt-handler pod fails to come up with the error message

 Error: path "/var/run/kubevirt" is mounted on "/" but it is not a shared mount.

In the Internet there are two recommendations to fix this:

Fix the systemd unit for docker: Obviously, using neither of them, this is not applicable...
Issue mount --make-shared /

The second command has a very strange side effect: Issueing that, the contents of a calico pod are mounted as an overlayfs on / of the host. This covers /proc and thus things like ps, mount and co. fail and basically the whole system becomes unusable until reboot.

This is fully reproducible. I first suspected the tmpfs on / to be the issue, used some disks instead of booting over network to check it and even a regular ext4 on / causes the exact same problem.

docker + calico + kubevirt = other shared mounts

Now, given that cri-o + calico + kubevirt does not lead to the expected result, what does the same setup with docker look like? The calico node pods with docker fail to come up, if /sys is not shared mounted, the virt-handler pods fail if /run is not shared mounted.

Two funky findings:

Issueing the following commands makes both work:

mount --make-shared /sys
mount --make-shared /run

The paths are totally different between docker and cri-o, even though the mapped hostpaths in the pod description are the same. And why is having /sys not being shared not a problem for calico in cri-o?

Log

Status 2021-06-07

Today I have updated the ceph cluster definition in rook to

check hosts every 10 minutes instead of 60m for new disks
use IPv6 instead of IPv6

The succesful ceph -s output:

[20:42] server47.place7:~/ungleich-k8s/rook# kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s
  cluster:
    id:     049110d9-9368-4750-b3d3-6ca9a80553d7
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim

  services:
    mon: 3 daemons, quorum a,b,d (age 75m)
    mgr: a(active, since 74m), standbys: b
    osd: 6 osds: 6 up (since 43m), 6 in (since 44m)

  data:
    pools:   2 pools, 33 pgs
    objects: 6 objects, 34 B
    usage:   37 MiB used, 45 GiB / 45 GiB avail
    pgs:     33 active+clean

The result is a working ceph clusters with RBD support. I also applied the cephfs manifest, however RWX volumes (readwritemany) are not yet spinning up. It seems that test helm charts often require RWX instead of RWO (readwriteonce) access.

Also the ceph dashboard does not come up, even though it is configured:

[20:44] server47.place7:~# kubectl -n rook-ceph get svc
NAME                       TYPE        CLUSTER-IP              EXTERNAL-IP   PORT(S)             AGE
csi-cephfsplugin-metrics   ClusterIP   2a0a:e5c0:13:e2::760b   <none>        8080/TCP,8081/TCP   82m
csi-rbdplugin-metrics      ClusterIP   2a0a:e5c0:13:e2::482d   <none>        8080/TCP,8081/TCP   82m
rook-ceph-mgr              ClusterIP   2a0a:e5c0:13:e2::6ab9   <none>        9283/TCP            77m
rook-ceph-mgr-dashboard    ClusterIP   2a0a:e5c0:13:e2::5a14   <none>        7000/TCP            77m
rook-ceph-mon-a            ClusterIP   2a0a:e5c0:13:e2::c39e   <none>        6789/TCP,3300/TCP   83m
rook-ceph-mon-b            ClusterIP   2a0a:e5c0:13:e2::732a   <none>        6789/TCP,3300/TCP   81m
rook-ceph-mon-d            ClusterIP   2a0a:e5c0:13:e2::c658   <none>        6789/TCP,3300/TCP   76m
[20:44] server47.place7:~# curl http://[2a0a:e5c0:13:e2::5a14]:7000
curl: (7) Failed to connect to 2a0a:e5c0:13:e2::5a14 port 7000: Connection refused
[20:45] server47.place7:~#

The ceph mgr is perfectly reachable though:

[20:45] server47.place7:~# curl -s http://[2a0a:e5c0:13:e2::6ab9]:9283/metrics | head

# HELP ceph_health_status Cluster health status
# TYPE ceph_health_status untyped
ceph_health_status 1.0
# HELP ceph_mon_quorum_status Monitors in quorum
# TYPE ceph_mon_quorum_status gauge
ceph_mon_quorum_status{ceph_daemon="mon.a"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.b"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.d"} 1.0
# HELP ceph_fs_metadata FS Metadata

Status 2021-06-06

Today is the first day of publishing the findings and this blog article will lack quite some information. If you are curious and want to know more that is not yet published, you can find me on Matrix in the #hacking:ungleich.ch room.

What works so far

Spawing pods IPv6 only
Spawing IPv6 only services works
BGP Peering and ECMP routes with the upstream infrastructure works

Here's an output of the upstream bird process for the routes from k8s:

bird> show route
Table master6:
2a0a:e5c0:13:e2::/108 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
                     unicast [place7-server3 2021-06-05] (100) [AS65534i]
        via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
                     unicast [place7-server4 2021-06-05] (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
                     unicast [place7-server2 2021-06-05] (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
2a0a:e5c0:13:e1:176b:eaa6:6d47:1c40/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
                     unicast [place7-server4 23:45:21.591] (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
                     unicast [place7-server3 23:45:21.591] (100) [AS65534i]
        via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
                     unicast [place7-server2 23:45:21.589] (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
2a0a:e5c0:13:e1:e0d1:d390:343e:8480/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
                     unicast [place7-server3 2021-06-05] (100) [AS65534i]
        via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
                     unicast [place7-server4 2021-06-05] (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
                     unicast [place7-server2 2021-06-05] (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
2a0a:e5c0:13::/48    unreachable [v6 2021-05-16] * (200)
2a0a:e5c0:13:e1:9b19:7142:bebb:4d80/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
                     unicast [place7-server3 2021-06-05] (100) [AS65534i]
        via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
                     unicast [place7-server4 2021-06-05] (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
                     unicast [place7-server2 2021-06-05] (100) [AS65534i]
        via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
bird>

What doesn't work

Rook does not format/spinup all disks
Deleting all rook components fails (kubectl delete -f cluster.yaml hangs forever)
Spawning VMs fails with error: unable to recognize "vmi.yaml": no matches for kind "VirtualMachineInstance" in version "kubevirt.io/v1"