So, we have a fully-functional Ceph object store. Yay! Now let's add all the frosting needed to turn it into a distributed file system. What follows is based on the tail-end of the Storage Cluster Quick Start and the Fielsystem Quick Start.
As a reminder, here's where the cluster stands right now:
[ceph-deploy@admin ~]$ sudo ceph status
cluster:
id: 7832fc74-3164-454c-9789-e3e2fcc8940f
health: HEALTH_WARN
clock skew detected on mon.node2, mon.node3
services:
mon: 3 daemons, quorum node1,node2,node3
mgr: node1(active), standbys: node2, node3
mds: test_fs-2/2/2 up {0=node3=up:active,1=node2=up:active}, 1 up:standby
osd: 3 osds: 3 up, 3 in
data:
pools: 2 pools, 60 pgs
objects: 42 objects, 2.6 MiB
usage: 546 MiB used, 2.5 GiB / 3.0 GiB avail
pgs: 60 active+clean
Note the carping about "clock skew"; that seems to happen intermitently even though NTP is installed. I suspect it's a by-product of running in a virtual environment and suspending/resuming nodes and such.
Anyhow, we've got monitor daemons, manager daemons, ODS daemons... all the bits needed to use the object storage features of Ceph. In order to use the DFS features we have to set up the metadata server daemons. So let's install three of them, one per node, which will allow us to have a load-sharing config with a backup:
ceph-deploy mds create node1 node2 node3
The cluster now reports 3 MDS nodes, 1 active and 2 standby:
[ceph-deploy@admin ~]$ sudo ceph status
cluster:
id: 8a7aac93-aa5a-4328-8684-8392a2e9ce8f
health: HEALTH_WARN
clock skew detected on mon.node2, mon.node3
services:
mon: 3 daemons, quorum node1,node2,node3
mgr: node1(active), standbys: node2, node3
mds: test_fs-1/1/1 up {0=node3=up:active}, 2 up:standby
osd: 3 osds: 3 up, 3 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 41 MiB used, 2.9 GiB / 3.0 GiB avail
pgs:
Remember that we wanted to have two nodes active and 1 standby; we'll get to that in a moment.
Let's create a filesystem:
sudo ceph osd pool create cephfs_data 30
sudo ceph osd pool create cephfs_metadata 30
sudo ceph fs new test_fs cephfs_metadata cephfs_data
So what's going on here? A Ceph filesystem makes use of two pools,
one for data and one for metadata. The above creates two pools ("cephfs_data" and "cephfs_metadata") and then builds a filesystem named "test_fs" on top of them. All of that is pretty obvious, but what's up with the "30" that's passed as an argument during pool creation?
Placement groups "are an internal implementation detail of how Ceph distributes data"; the Ceph documentation provides a lot of guidance on how to calculate appropriate placement group sizes. '30' is the smallest value that prevents the cluster from complaining in my test setup; you'd almost certainly do something different in production.
Now, there's one little tweak to make before we can put the FS into service. The number of active metadata servers is controlled on an FS-by-FS basis; in order to go from 1 active/2 standby to 2 active/1 standby we need to update the max_mds attribute of the filesystem:
sudo ceph fs set test_fs max_mds 2
This tells Ceph that
test_fs can have up to two metadata servers active at any given time. And 'lo, so reporteth the cluster:
[ceph-deploy@admin ~]$ sudo ceph status
cluster:
id: 8a7aac93-aa5a-4328-8684-8392a2e9ce8f
health: HEALTH_WARN
clock skew detected on mon.node2, mon.node3
services:
mon: 3 daemons, quorum node1,node2,node3
mgr: node1(active), standbys: node2, node3
mds: test_fs-2/2/2 up {0=node3=up:active,1=node2=up:active}, 1 up:standby
osd: 3 osds: 3 up, 3 in
data:
pools: 2 pools, 60 pgs
objects: 40 objects, 3.5 KiB
usage: 46 MiB used, 2.9 GiB / 3.0 GiB avail
pgs: 60 active+clean
io:
client: 938 B/s wr, 0 op/s rd, 3 op/s wr
Two MDS nodes are active and one is in standby, just like we wanted.
Alright then, let's mount this sucker up. First, we need to create a plain textfile that contain's the admin user's key. Get the key from the admin node:
[ceph-deploy@admin ~]$ cat ceph.client.admin.keyring
[client.admin]
key = AQBttrdc5s8ZFRAApYevw6fxChKNLX2ViwQcHQ==
caps mds = "allow *"
caps mgr = "allow *"
caps mon = "allow *"
caps osd = "allow *"
and then copy it over to the client
[root@client ~]# cat > admin.secret
AQBttrdc5s8ZFRAApYevw6fxChKNLX2ViwQcHQ==
^D
This file will be used to auth against the Ceph cluster during the mount process. Next, make a mount point:
[root@client ~]# mkdir /mnt/mycephfs
and mount the filesystem:
[root@client ~]# mount -t ceph node1,node2,node3:/ /mnt/mycephfs -o name=admin,secretfile=admin.secret
The above make the client aware of all three monitor daemons (yay HA) and mounts the filesystem as the
admin user.
Et voila!:
[root@client ~]# mount | grep ceph
10.0.0.3,10.0.0.4,10.0.0.5:/ on /mnt/mycephfs type ceph (rw,relatime,name=admin,secret=,acl,wsize=16777216)
[root@client ~]# ls -la /mnt/mycephfs
total 0
drwxr-xr-x 1 root root 0 Apr 18 00:54 .
drwxr-xr-x. 4 root root 38 Apr 18 18:30 ..
So, can it dance?
[root@client mycephfs]# cd /mnt/mycephfs/
[root@client mycephfs]# yum install sysbench
...
[root@client mycephfs]# sysbench --test=fileio --file-total-size=500M --file-test-mode=rndrw prepare
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.17 (using system LuaJIT 2.0.4)
128 files, 4000Kb each, 500Mb total
Creating files for the test...
Extra file open flags: (none)
Creating file test_file.0
...
Creating file test_file.127
524288000 bytes written in 13.77 seconds (36.32 MiB/sec).
So far, so good, right? Looks like its a filesystem. And, after mounting the filesystem, we can see the files from the admin node:
[ceph-deploy@admin ~]$ ls -l /mnt/mycephfs/
total 512000
-rw------- 1 root root 4096000 Apr 18 2019 test_file.0
...
Now, for comparison, here's the prepare stage using just a single /dev/sdb device:
[root@node1 ~]# mkdir /mnt/test
[root@node1 ~]# mount /dev/sdb /mnt/test
[root@node1 ~]# cd /mnt/test
[root@node1 test]# sysbench --test=fileio --file-total-size=500M --file-test-mode=rndrw prepare
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.17 (using system LuaJIT 2.0.4)
128 files, 4000Kb each, 500Mb total
Creating files for the test...
Extra file open flags: (none)
Creating file test_file.0
...
Creating file test_file.127
524288000 bytes written in 1.85 seconds (269.93 MiB/sec).
Which tells us... not a whole lot, except maybe don't expect any sort of performance if you build a Ceph cluster on your laptop. It does demonstrate that there's non-trivial overhead, but that's exactly what we'd expect. All of the bookkeeping for a DFS is way more complicated than writing to a single, XFS file system.
And that pretty much concludes the messing around. As noted over and over (and over...) nothing I do on this here blog should be treated as production-worthy unless I explicitly say so; this is doubley-true in the case of these Ceph experiments. There's a tremendous amount of thought and tuning that goes into putting together a DFS suitable for production. Ceph has a lot of knobs to turn, can be made very smart about data placement and failure domains, and so on. That said, I am quite pleased with the experience I've documented here; the ceph-deploy tool in particular makes things very easy. Were I to need a DFS in production I expect that Ceph would easily make the top of the shortlist.