Messing Around With Ceph (3/N)
So, we have a fully-functional Ceph object store. Yay! Now let's add all the frosting needed to turn it into a distributed file system. What follows is based on the tail-end of the Storage Cluster Quick Start and the Fielsystem Quick Start.
As a reminder, here's where the cluster stands right now:
[ceph-deploy@admin ~]$ sudo ceph status cluster: id: 7832fc74-3164-454c-9789-e3e2fcc8940f health: HEALTH_WARN clock skew detected on mon.node2, mon.node3 services: mon: 3 daemons, quorum node1,node2,node3 mgr: node1(active), standbys: node2, node3 mds: test_fs-2/2/2 up {0=node3=up:active,1=node2=up:active}, 1 up:standby osd: 3 osds: 3 up, 3 in data: pools: 2 pools, 60 pgs objects: 42 objects, 2.6 MiB usage: 546 MiB used, 2.5 GiB / 3.0 GiB avail pgs: 60 active+cleanNote the carping about "clock skew"; that seems to happen intermitently even though NTP is installed. I suspect it's a by-product of running in a virtual environment and suspending/resuming nodes and such.
Anyhow, we've got monitor daemons, manager daemons, ODS daemons... all the bits needed to use the object storage features of Ceph. In order to use the DFS features we have to set up the metadata server daemons. So let's install three of them, one per node, which will allow us to have a load-sharing config with a backup:
ceph-deploy mds create node1 node2 node3The cluster now reports 3 MDS nodes, 1 active and 2 standby:
[ceph-deploy@admin ~]$ sudo ceph status cluster: id: 8a7aac93-aa5a-4328-8684-8392a2e9ce8f health: HEALTH_WARN clock skew detected on mon.node2, mon.node3 services: mon: 3 daemons, quorum node1,node2,node3 mgr: node1(active), standbys: node2, node3 mds: test_fs-1/1/1 up {0=node3=up:active}, 2 up:standby osd: 3 osds: 3 up, 3 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 41 MiB used, 2.9 GiB / 3.0 GiB avail pgs:Remember that we wanted to have two nodes active and 1 standby; we'll get to that in a moment.
Let's create a filesystem:
sudo ceph osd pool create cephfs_data 30 sudo ceph osd pool create cephfs_metadata 30 sudo ceph fs new test_fs cephfs_metadata cephfs_dataSo what's going on here? A Ceph filesystem makes use of two pools, one for data and one for metadata. The above creates two pools ("cephfs_data" and "cephfs_metadata") and then builds a filesystem named "test_fs" on top of them. All of that is pretty obvious, but what's up with the "30" that's passed as an argument during pool creation?
Placement groups "are an internal implementation detail of how Ceph distributes data"; the Ceph documentation provides a lot of guidance on how to calculate appropriate placement group sizes. '30' is the smallest value that prevents the cluster from complaining in my test setup; you'd almost certainly do something different in production.
Now, there's one little tweak to make before we can put the FS into service. The number of active metadata servers is controlled on an FS-by-FS basis; in order to go from 1 active/2 standby to 2 active/1 standby we need to update the max_mds attribute of the filesystem:
sudo ceph fs set test_fs max_mds 2This tells Ceph that test_fs can have up to two metadata servers active at any given time. And 'lo, so reporteth the cluster:
[ceph-deploy@admin ~]$ sudo ceph status cluster: id: 8a7aac93-aa5a-4328-8684-8392a2e9ce8f health: HEALTH_WARN clock skew detected on mon.node2, mon.node3 services: mon: 3 daemons, quorum node1,node2,node3 mgr: node1(active), standbys: node2, node3 mds: test_fs-2/2/2 up {0=node3=up:active,1=node2=up:active}, 1 up:standby osd: 3 osds: 3 up, 3 in data: pools: 2 pools, 60 pgs objects: 40 objects, 3.5 KiB usage: 46 MiB used, 2.9 GiB / 3.0 GiB avail pgs: 60 active+clean io: client: 938 B/s wr, 0 op/s rd, 3 op/s wrTwo MDS nodes are active and one is in standby, just like we wanted.
Alright then, let's mount this sucker up. First, we need to create a plain textfile that contain's the admin user's key. Get the key from the admin node:
[ceph-deploy@admin ~]$ cat ceph.client.admin.keyring [client.admin] key = AQBttrdc5s8ZFRAApYevw6fxChKNLX2ViwQcHQ== caps mds = "allow *" caps mgr = "allow *" caps mon = "allow *" caps osd = "allow *"and then copy it over to the client
[root@client ~]# cat > admin.secret AQBttrdc5s8ZFRAApYevw6fxChKNLX2ViwQcHQ== ^DThis file will be used to auth against the Ceph cluster during the mount process. Next, make a mount point:
[root@client ~]# mkdir /mnt/mycephfsand mount the filesystem:
[root@client ~]# mount -t ceph node1,node2,node3:/ /mnt/mycephfs -o name=admin,secretfile=admin.secretThe above make the client aware of all three monitor daemons (yay HA) and mounts the filesystem as the admin user. Et voila!:
[root@client ~]# mount | grep ceph 10.0.0.3,10.0.0.4,10.0.0.5:/ on /mnt/mycephfs type ceph (rw,relatime,name=admin,secret=,acl,wsize=16777216) [root@client ~]# ls -la /mnt/mycephfs total 0 drwxr-xr-x 1 root root 0 Apr 18 00:54 . drwxr-xr-x. 4 root root 38 Apr 18 18:30 ..
So, can it dance?
[root@client mycephfs]# cd /mnt/mycephfs/ [root@client mycephfs]# yum install sysbench ... [root@client mycephfs]# sysbench --test=fileio --file-total-size=500M --file-test-mode=rndrw prepare WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options. sysbench 1.0.17 (using system LuaJIT 2.0.4) 128 files, 4000Kb each, 500Mb total Creating files for the test... Extra file open flags: (none) Creating file test_file.0 ... Creating file test_file.127 524288000 bytes written in 13.77 seconds (36.32 MiB/sec).So far, so good, right? Looks like its a filesystem. And, after mounting the filesystem, we can see the files from the admin node:
[ceph-deploy@admin ~]$ ls -l /mnt/mycephfs/ total 512000 -rw------- 1 root root 4096000 Apr 18 2019 test_file.0 ...
Now, for comparison, here's the prepare stage using just a single /dev/sdb device:
[root@node1 ~]# mkdir /mnt/test [root@node1 ~]# mount /dev/sdb /mnt/test [root@node1 ~]# cd /mnt/test [root@node1 test]# sysbench --test=fileio --file-total-size=500M --file-test-mode=rndrw prepare WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options. sysbench 1.0.17 (using system LuaJIT 2.0.4) 128 files, 4000Kb each, 500Mb total Creating files for the test... Extra file open flags: (none) Creating file test_file.0 ... Creating file test_file.127 524288000 bytes written in 1.85 seconds (269.93 MiB/sec).Which tells us... not a whole lot, except maybe don't expect any sort of performance if you build a Ceph cluster on your laptop. It does demonstrate that there's non-trivial overhead, but that's exactly what we'd expect. All of the bookkeeping for a DFS is way more complicated than writing to a single, XFS file system.
And that pretty much concludes the messing around. As noted over and over (and over...) nothing I do on this here blog should be treated as production-worthy unless I explicitly say so; this is doubley-true in the case of these Ceph experiments. There's a tremendous amount of thought and tuning that goes into putting together a DFS suitable for production. Ceph has a lot of knobs to turn, can be made very smart about data placement and failure domains, and so on. That said, I am quite pleased with the experience I've documented here; the ceph-deploy tool in particular makes things very easy. Were I to need a DFS in production I expect that Ceph would easily make the top of the shortlist.
0 Comments:
Post a Comment
<< Home