Setting up single-node Ceph cluster for Kubernetes
After setting up my shiny single-node Kubernetes "cluster" I wanted to do something useful with it. Many of the useful things require your useful data not to disappear, so I needed to figure out how to do that.
In Kubernetes storage is organized via Volumes. These Volumes can be attached to Pods via Volume Claims. There are two types of Volumes: regular ones and Persistent Volumes. Regular Volumes are ephemeral and destroyed together with Pods (e.g. when it crashes or rescheduled). These are of course less interesting than Persistent Volumes which as the name suggest survive Pod restarts.
There are many ways to implement Persistent Volumes, the simplest is probably to use Local Persistent Volumes. They simply bind local directory into pod. However they force pods to be always running on the same node.
This was not interesting enough for me so I went with something more complicated.
Choosing storage provider
As with many things in Kubernetes there are many ways to implement Persistent Volumes.
When running on top of public cloud the choice is fairly obvious: it's usually best to use the Cloud's virtual block device such as gcePersistentDisk or awsElasticBlockStore.
I only had my poor server-laptop, potential to add more hosts in the future, and no virtual block device providers. So I had to build my own virtual block devices or filesystems.
Here we have Gluster, EdgeFS and Ceph. The latter seemed to be the most interesting to me (and most familiar - I was an SRE for a very similar system).
Planning Ceph installation
I had the single node with 3 different devices to play with. The node is just an old laptop that I use as a home server. It has:
- tiny 20G SSD, where the system is installed, mounted as /
- larger (but slower) HDD
- even larger (and even slower) USB-HDD
By default Ceph requires at least 3 different nodes with at least one storage device each. However it is possible to configure Ceph's CRUSH to accept a single node.
Thus I decided I'd use replicated bucket with min_size of 2. So each piece of data will be stored at least 2 times on two different devices. This way I won't lose any data if one of the storage devices fails. And of course I'd lose everything if the whole laptop dies. This is OK for a home system, and I may add more hardware in the future.
Deploying Ceph
There seems to be two ways of deploying Ceph:
- Manually outside of Kubernetes (installing via OS packages)
- On top of Kubernetes.
Setting Ceph manually seemed to be a lot of hassle. There's just too many components.
For option 2 Rook is the most popular solution. Rook is a Kubernetes Operator - essentially a set of tools which make it easier to deploy and manage Ceph and other storage systems on top of Kubernetes. That's what I decided to use.
Creating Rook manifest
git clone https://github.com/rook/rook cd rook git checkout v1.3.8 cd cluster/examples/kubernetes/ceph
Create common resources:
kubectl create -f common.yaml
Make sure we have the resources:
$ kubectl explain cephclusters KIND: CephCluster VERSION: ceph.rook.io/v1 DESCRIPTION: <empty>
Create operator:
kubectl create -f operator.yaml
Check for its status:
$ kubectl get pod -n rook-ceph NAME READY STATUS RESTARTS AGE rook-ceph-operator-6cc9c67b48-m875j 1/1 Running 0 15s
Defining and creating a cluster
Modern versions of Ceph use "Bluestore" as a storage engine. Bluestore works on top of raw block devices, so for this installation I'd leave small SSD for a system, will use part of the second HDD, and will use entire USB-HDD for Ceph.
LVM2 must be still installed:
sudo dnf install lvm2
Next, it was necessary to deploy manifests to create a basic Ceph cluster. These run components of Ceph: manager, monitor, OSD and several others. OSDs are probably the most interesting component - they do the actual interaction with storage devices, and there will be one per device.
Since I used non-standard Ceph setup with just two devices I had to tweak sample manifests quite a bit. Here's cluster.yaml
:
################################################################################################################# # Define the settings for the rook-ceph cluster with common settings for a production cluster. # All nodes with available raw devices will be used for the Ceph cluster. At least three nodes are required # in this example. See the documentation for more details on storage settings available. # For example, to create the cluster: # kubectl create -f common.yaml # kubectl create -f operator.yaml # kubectl create -f cluster.yaml ################################################################################################################# kind: ConfigMap apiVersion: v1 metadata: name: rook-config-override namespace: rook-ceph data: config: | [global] osd_pool_default_size = 1 --- apiVersion: ceph.rook.io/v1 kind: CephCluster metadata: name: rook-ceph namespace: rook-ceph spec: cephVersion: image: ceph/ceph:v14.2.10 allowUnsupported: false dataDirHostPath: /var/lib/rook skipUpgradeChecks: false continueUpgradeAfterChecksEvenIfNotHealthy: false mon: # This is one of the tweaks - normally you'd want more than one monitor and # you'd want to spread them out count: 1 allowMultiplePerNode: true mgr: modules: - name: pg_autoscaler enabled: true dashboard: enabled: true ssl: true monitoring: # requires Prometheus to be pre-installed enabled: false rulesNamespace: rook-ceph network: # Enable host networking. This is useful if we want to mount ceph outside of # Kubernetes virtual network. provider: host rbdMirroring: workers: 0 crashCollector: disable: false cleanupPolicy: confirmation: "" annotations: resources: removeOSDsIfOutAndSafeToRemove: false # This is an interesting section - it determines which devices will be # considered to be part of bluestore. Devices must be empty. Use wipefs to # clean them up. storage: useAllNodes: true useAllDevices: false devices: # # HDD: - name: "/dev/disk/by-id/ata-Hitachi_HTS545050A7E380_TE85113Q0J4S5R-part2" # USB-HDD: # by-id does not work due to a bug #- name: "/dev/disk/by-id/usb-WD_Elements_1042_57584C314139313237333936-0:0" - name: "sdc" config: osdsPerDevice: "1" storeType: bluestore disruptionManagement: managePodBudgets: false osdMaintenanceTimeout: 30 manageMachineDisruptionBudgets: false machineDisruptionBudgetNamespace: openshift-machine-api
Another useful this to have is a toolbox - will help me maintain Ceph without having to install Ceph tools on my hosts. Here's toolbox.yaml:
apiVersion: v1 kind: Pod metadata: name: rook-ceph-tools namespace: rook-ceph labels: app: rook-ceph-tools spec: dnsPolicy: ClusterFirstWithHostNet containers: - name: rook-ceph-tools image: rook/ceph:v1.3.8 command: ["/tini"] args: ["-g", "--", "/usr/local/bin/toolbox.sh"] imagePullPolicy: IfNotPresent env: - name: ROOK_ADMIN_SECRET valueFrom: secretKeyRef: name: rook-ceph-mon key: admin-secret securityContext: privileged: true volumeMounts: - mountPath: /etc/ceph name: ceph-config - name: mon-endpoint-volume mountPath: /etc/rook hostNetwork: true volumes: - name: mon-endpoint-volume configMap: name: rook-ceph-mon-endpoints items: - key: data path: mon-endpoints - name: ceph-config emptyDir: {}
Let's apply these two and see them in action:
kubectl apply -f cluster.yaml # Let us also deploy a toolbox - will help us monitor cluster status. kubectl apply -f toolbox.yaml
Check pods:
$ kubectl get pod -n rook-ceph NAME READY STATUS RESTARTS AGE csi-cephfsplugin-provisioner-7469b99d4b-6wwdk 5/5 Running 1 3d16h csi-cephfsplugin-provisioner-7469b99d4b-mj2zc 5/5 Running 2 3d16h csi-cephfsplugin-rl2zf 3/3 Running 0 3d16h csi-rbdplugin-hw8fh 3/3 Running 0 3d16h csi-rbdplugin-provisioner-865f4d8d-dp5d9 6/6 Running 4 3d16h csi-rbdplugin-provisioner-865f4d8d-r2wlf 6/6 Running 0 3d16h rook-ceph-crashcollector-krusty.home.greenmice.info-75fdd66f7q2 1/1 Running 0 137m rook-ceph-mgr-a-64fd77c8fd-fhc4n 1/1 Running 4 4d14h rook-ceph-mon-a-cb5b84f5c-wjqjb 1/1 Running 5 4d14h rook-ceph-operator-6cc9c67b48-ltvxm 1/1 Running 1 3d16h rook-ceph-operator-6cc9c67b48-m875j 0/1 Terminating 8 4d14h rook-ceph-osd-0-7cd6975767-swtgr 1/1 Running 0 137m rook-ceph-osd-1-7b87f564fc-mcbd5 1/1 Running 0 137m rook-ceph-osd-prepare-krusty.home.greenmice.info-zsjd2 0/1 Completed 0 129m rook-ceph-tools 1/1 Running 0 100s rook-discover-htxkp 1/1 Running 0 3d16h $ kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph status cluster: id: 6ab148e7-fd6e-4132-bdaf-e5ce7934d2cb health: HEALTH_OK services: mon: 1 daemons, quorum a (age 17h) mgr: a(active, since 2h) osd: 2 osds: 2 up (since 2h), 2 in (since 2h) data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 2.0 GiB used, 1.3 TiB / 1.3 TiB avail pgs:
Using Ceph - test filesystem
The above does not create any Ceph 'pools' - we have Ceph components but no storage is actually usable. I've created one for test and mounted to my workstation outside of the Kubernetes cluster.
To do this slightly modified filesystem-test.yaml
from examples directory. I
changed name and flipped activeStandby
to false. Note this pool has replica
size of 1 so it does not provide any redundancy.
apiVersion: ceph.rook.io/v1 kind: CephFilesystem metadata: name: test-fs namespace: rook-ceph spec: metadataPool: replicated: size: 1 requireSafeReplicaSize: false dataPools: - failureDomain: osd replicated: size: 1 requireSafeReplicaSize: false compressionMode: none preservePoolsOnDelete: false metadataServer: activeCount: 1 activeStandby: false
Applied via:
kubectl apply -f filesystem-test.yaml
After this I could see my test-fs in the ceph status
output:
$ kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph status cluster: id: 6ab148e7-fd6e-4132-bdaf-e5ce7934d2cb health: HEALTH_WARN 2 pool(s) have no replicas configured services: mon: 1 daemons, quorum a (age 20h) mgr: a(active, since 4h) mds: test-fs:1 {0=test-fs-b=up:active} 1 up:standby-replay osd: 2 osds: 2 up (since 5h), 2 in (since 5h) task status: scrub status: mds.test-fs-a: idle mds.test-fs-b: idle data: pools: 2 pools, 64 pgs objects: 22 objects, 2.2 KiB usage: 2.0 GiB used, 1.3 TiB / 1.3 TiB avail pgs: 64 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
Now, in order to mount it I needed to know address of the monitor service and a secret. There are probably better ways to do this but for test this can be obtained via these two commands:
$ kubectl -n rook-ceph exec -it rook-ceph-tools -- grep mon_host /etc/ceph/ceph.conf mon_host = 192.168.0.54:6789 $ kubectl -n rook-ceph exec -it rook-ceph-tools -- grep key /etc/ceph/keyring key = A<xxx>g==
Or, to save in variables:
mon_host=$(kubectl -n rook-ceph exec -it rook-ceph-tools -- grep mon_host /etc/ceph/ceph.conf | cut -d " " -f 3 | tr -d '\r') ceph_secret=$(kubectl -n rook-ceph exec -it rook-ceph-tools -- grep key /etc/ceph/keyring | cut -d " " -f 3 | tr -d '\r')
In order to mount it the kernel needs to be compiled with Ceph support, and Ceph tools must be installed. My client was Gentoo and I was able to install Ceph tools via following command:
sudo emerge -av sys-cluster/ceph
And mount:
sudo mkdir -p /mnt/ceph-test sudo mount -t ceph -o mds_namespace=test,name=admin,secret=$ceph_secret $mon_host:/ /mnt/ceph-test # By default permissions set to be only writable by root sudo touch /mnt/ceph-test/test sudo rm /mnt/ceph-test/test
Using Ceph - "prod" filesystem
Filesystem with redundancy of 1 would obviously have terrible durability. Ceph recommends replication factor of 3 or using Reed-Solomon encoding.
I used replication of 2 instead. It still has poor durability. Not only because 2 device failures may result in data loss, but also if there is some form of corruption (e.g. due to bitrot or sudden crash), Ceph may not be able to determine which one of the two replicas is correct. Still, it's good enough for my purpose.
I removed test filesystem created earlier and created a proper one:
# https://github.com/rook/rook/blob/master/Documentation/ceph-filesystem.md apiVersion: ceph.rook.io/v1 kind: CephFilesystem metadata: name: replicated2 namespace: rook-ceph spec: metadataPool: replicated: size: 2 requireSafeReplicaSize: true dataPools: # 'failureDomain: osd' protects from a single osd crash or single device # failure but not from the whole node failure. - failureDomain: osd replicated: size: 2 requireSafeReplicaSize: true compressionMode: aggressive preservePoolsOnDelete: false metadataServer: activeCount: 1 activeStandby: false
Another thing that is required is a StorageClass. Without it, the filesystem would be created but it won't be possible to reference it via PersistentVolumeClaim.
# https://github.com/rook/rook/blob/master/Documentation/ceph-filesystem.md apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: rook-cephfs # Change "rook-ceph" provisioner prefix to match the operator namespace if needed provisioner: rook-ceph.cephfs.csi.ceph.com allowVolumeExpansion: true parameters: # clusterID is the namespace where operator is deployed. clusterID: rook-ceph # CephFS filesystem name into which the volume shall be created fsName: replicated2 # Ceph pool into which the volume shall be created # Required for provisionVolume: "true" pool: replicated2-data0 # The secrets contain Ceph admin credentials. These are generated automatically by the operator # in the same namespace as the cluster. csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph reclaimPolicy: Delete
And that was enough to be able to use CephFS as persistent volumes for my Kubernetes Pods, e.g. for Jellyfin:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: jellyfin-config namespace: media spec: accessModes: - ReadWriteOnce storageClassName: rook-cephfs resources: requests: storage: 1Gi --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: media namespace: media spec: accessModes: - ReadWriteMany storageClassName: rook-cephfs resources: requests: storage: 100Gi --- # Jellyfin deployment apiVersion: apps/v1 kind: Deployment metadata: name: jellyfin-deployment labels: app: jellyfin namespace: media spec: replicas: 1 strategy: type: Recreate selector: matchLabels: app: jellyfin template: metadata: labels: app: jellyfin spec: containers: - name: jellyfin image: linuxserver/jellyfin:version-10.7.5-1 ports: - containerPort: 8096 env: - name: TZ value: "Europe/Dublin" - name: UMASK value: "000" - name: PUID value: "1000" - name: PGID value: "1000" volumeMounts: - name: media mountPath: /data - name: jellyfin-config mountPath: /config resources: limits: cpu: "4" memory: "1Gi" requests: cpu: "10m" memory: "512Mi" securityContext: fsGroup: 1000 fsGroupChangePolicy: "OnRootMismatch" volumes: - name: media persistentVolumeClaim: claimName: media - name: jellyfin-config persistentVolumeClaim: claimName: jellyfin-config
Conclusion
Open-source cluster filesystems are here and available for hobbyists. Learning about them and setting everything up took many evenings (I did not mention many mistakes I did in this article), but now I have my own cluster filesystem.
I did not actually use it for anything yet. Per documentation it should be a matter of creating non-test CephFilesystem and referencing it in Persistent Volume Claim of deployments. I may write about it in the future.
Future work
- Add more nodes, and increase number of monitors to 3.
- Upgrade to Ceph 15.
Links
- Overview:
- Guides:
- Documentation:
Comments