Shiny Ideas: April 2019

Monday, April 29, 2019

Using Foreman For Bare Metal Provisioning (2/N)

We've got a testbed for Foreman in place, so let's see what it takes to get it up and running. I'll be following the CentOS 7 instructions from the Quickstart Guide.

First, some preliminaries:

Make sure that you have enough RAM allocated to the VM. I tried to do this with a 1G VM (the default for 64-bit RedHat under VirtualBox) and got OOM errors.
The Foreman installer wants the FQDN (as returned by facter fqdn) to match the output of hostname -f. An easy way to do that for the purpose of experimentation is to edit /etc/hosts and replace the existing loopback entry with 127.0.0.1 foreman.localdomain foreman.
Ensure that the host firewall is off: systemctl disable firewalld; iptables -F.

Installation is very easy:

# yum -y install https://yum.puppetlabs.com/puppet5/puppet5-release-el-7.noarch.rpm
...
# yum -y install http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
...
# yum -y install https://yum.theforeman.org/releases/1.20/el7/x86_64/foreman-release.rpm
...
# yum -y install foreman-installer
...
# foreman-installer
...
  Success!
  * Foreman is running at https://foreman.localdomain
      Initial credentials are admin / 8pqUnEHJ2znCcUVC
  * Foreman Proxy is running at https://foreman.localdomain:8443
  * Puppetmaster is running at port 8140
  The full log is at /var/log/foreman-installer/foreman.log

The Foreman team should get credit for doing a good job automating the process. One thing I noticed immediately, and which explains the care taken with automating installation, is that Foreman is fairly complex. The following services are running on the VM post-installation:

Puppet server
Postgres
Apache
Passenger

Non-trivial, to say the least.

Having reviewed the Foreman manual, especially section 4.4 on Provisioning, it also seems like just getting a client to PXE-boot is very involved (see "4.5.2 Success Story" for required CLI commands), and this subsequent client handling is Puppet-centric.

Honestly, at this point I just want to have a client come up and register itself with a server, maybe get a description of what hardware is available and that sort of thing. I'd really prefer to defer decisions about operating systems and CM systems until after that. Foreman seems like its too involved for the moment; I'm not rendering a permanent judgement on it yet, but I do want to set it down and see what else is out there.

Saturday, April 27, 2019

(Non-)Local Property Rights, Expropriation, And Various Sequelae

A bazillion years ago I got into a discussion with some of the folks at Crooked Timber about the scope of property rights and how said scope affects claims of expropriation. It's a fundamental, and interesting, issue: What rules hold between different societies, especially at time of first contact? When two moral systems come into conflict, which system should be the benchmark for evaluating behavior?

John Quiggin had a bunch of interesting things to say in the comments. One comment in particular has stuck in my head all this time:

GG @11, and others I think expropriation is the right term here. People had property rights under 0ne system (in this case, that of indigenous societies) and these property rights were taken away (expropriated) when that system was forcibly displaced by a new one (the US state).

As argued in the OP in relation to property in general, expropriation isn’t just or unjust in itself, it depends on the justice of the social change that makes it possible. As you say here, the transgression against indigensou [sic] people arise from their loss of autonomy, not to mention capacity to feed themselves. The fact that some indigenous people lost property rights was part of this process, but not the essential injustice

It's certainly seems like a good response on first blush, but on reflection I think it has some problems which bear further consideration.

John says that the deciding factor in evaluating whether an expropriation is "just" or "unjust" is the "justice of the social change that makes it possible". So far, so good. But then the question which immediately arises is what framework should be used to evaluate the "justice of the social change"?

He makes use of a contemporary, progressive framework, as exhibited by his concern over concepts like autonomy and quality of life impacts. However, given the passage of time between the present and the acts in question, its reasonable to consider whether such application is an exercise in presentism. John doesn't appear to have written much of anything on the concept; the only item I could dig up is a comment on an unrelated thread:

And my problem is that I don’t accept that presentism is a (generally) useful concept in understanding modern history. To be sure, there are issues like equal marriage where modern views are very different from those of even the recent past. But on the issues discussed in the OP, and on most major issues of public policy debate today, the arguments of the 19th and 20th century are still going on: we have more evidence, and there have been changes in the balance of opinion, but the issues are essentially the same.

Historic and contemporary understandings of expropriation, and the morality thereof, are not "essentially the same", so there's an argument to be made that John is guilty of presentism by his own metric.

If we decide that, due to the passage of time, a historic framework is more appropriate, we're still left with the task of identifying which historic framework is most appropriate. My intuition is that there are a number of reasons (past and present power imbalances, rectification of current inequity, etc.) to choose to evaluate the justice of these actions from the perspective of the indigenous peoples so affected. This approach has some challenges; there were a variety of moral systems endorsed by the indigenous peoples of the Americas, which greatly restricts our ability to make broad statements about the (in)justice of various acts. Rather, it seems that we have to evaluate instances of expropriation on a case-by-case basis.

Also, if you take this approach, things get real weird real quick. The archeologic record regarding pre-contact, inter-tribal warfare is pretty gruesome. Some tribes were totally OK with expansion by conquest, and there's ample evidence of settlements being basically annihilated by hostile tribes. Viewed through this lens a lot of post-contact history looks like business as usual, with all the complications for notions of "justice" that entails. Indigenous groups would certainly have objected to expropriation of their land, but not all of them could have done so on a principled basis.

One highly-probably response to the above is essentially "the devil can quote Scripture for his purposes". There's a grain of truth to that; sometimes I just want to watch the world burn. But trollish purpose doesn't, by itself, invalidate the argument.

Really, what the above highlights is a fundamental tension between rendering moral judgements about distant times and places while maintaining certain progressive commitments in the present. Put more generically, suppose:

Our present-day moral intuition is that some practice X is categorically repugnant.
X is/was practiced by marginalized group Y.

The first bullet effectively commits us to moral universalism, at least in the context of practice X. Implicit in this belief, however, is the judgement that margnizalized group Y's behavior is/was immoral in an absolute sense. However, there is a strong argument to be made that such judgements, coming from white, Western, progressive types, are really just a form of moral colonialism.

My experience is that the above contradiction leads to a lot of waffling and equivocation in practice (which drives me up the wall, in case you couldn't figure it out): "X is always wrong?" "Yes." "So you believe that there are universally applicable moral norms?" "Not, but..."

sigh

Also, while I'm at it, I should note that a lot of the people participating in the discussion at Crooked Timber made a big deal of the difference between "property" and "possession", the former being a legal construct and the latter not. In the context at hand that seems to me to be a distinction without a difference. Legally, property is a bundle of rights i.e. the right to use, to enjoy, to exclude, and so on. However, for "posession" to be more than notional it must surely include at least the rights to use, enjoy, and exclude; what's left over (the rights to sell, lease, encumber, and will) don't seem to have much bearing on discussions of expropriation.

Friday, April 26, 2019

What I've Been Saying About Gender...

... said better, and more rigorously. Go read Sophie Allen's piece How not to find out who counts as a woman: a response to Carol Hay.

Using Foreman For Bare Metal Provisioning (1/N)

After reviewing current offerings for bare metal provisioning I've decided to take Foreman for a spin. The first thing that needs to happen is to set up an appropriate test bed. Since I don't presently have access to a lump of real hardware we're going to fake it, as usual, using VirtualBox.

Because Foreman does its magic using PXE we'll need to set up a simulated LAN and then provide the DHCP ourselves. I experimented a little bit with Vagrant to see if it could be made to do this, but the use case is far enough from Vagrant's core purpose that its not worth the hassle. Rather, it seems simpler to set it up manually.

The first thing to do is configure a NAT network. VirtualBox's NAT networks behave a lot like your home network, providing a LAN with a gateway that does NAT translation. Normally when you set up one of these it also provides DHCP, but we need to turn that feature off since we're going to be providing it ourselves. Here we go:

VBoxManage natnetwork add --netname natnet1 --network "192.168.15.0/24" --enable --dhcp off

That creates a /24 NAT network named "natnet1". Note that the command doesn't ask you to specify the gateway IP; one gets picked automatically. The list natnetworks command can be used to identify the gateway:

$ VBoxManage list natnetworks
NetworkName:    natnet1
IP:             192.168.15.1
Network:        192.168.15.0/24
IPv6 Enabled:   No
IPv6 Prefix:    fd17:625c:f037:2::/64
DHCP Enabled:   No
Enabled:        Yes
loopback mappings (ipv4)
        127.0.0.1=2

So the gateway for the network is 192.168.15.1.

Next up, let's create an empty VM to be used to house Foreman. I found it easier to do this through the VirtualBox UI than to try to script it. Here are the relevant bits of config:

Name: foreman
Type: Linux
Version: Red Hat (64-bit)
4G RAM (as specified in the Quickstart Guide)
Create a thin-provisioned virtual disk
Enable network adapter 1 and attach it to Nat Network "natnet1".

Since we're bootstrapping the provisioning system we'll have to do some sort of media-based install. I've chosen to use a CentOS 7 "everything" image, CentOS-7-x86_64-Everything-1810.iso (available via one of the official CentOS mirrors), mostly on the grounds of simplicity, but other install images will work just as well. Attach the ISO to the VM:

$ VBoxManage storageattach foreman --storagectl IDE --port 0 --device 0 --type dvddrive --medium ~/Desktop/CentOS-7-x86_64-Everything-1810.iso

Power up the VM, and perform a minimal install manually. Here are the relevant bits of config for that:

Name: foreman.localdomain
IP: 192.168.15.254
Netmask: 255.255.255.0
Gateway: 192.168.15.1
Name servers: Set as appropriate for your envirnoment.

With the VM up and running we're nearly there; the last thing to do is to enable inbound SSH to the new VM. We'll do this by adding a port forwarding rule to natnet1:

$ VBoxManage natnetwork modify --netname natnet1 --port-forward-4 'ssh:tcp:[127.0.0.1]:2222:[192.168.15.254]:22'

This will forward port 2222 on localhost to the SSH daemon on the foreman VM. Let's test it out:

$ ssh -p 2222 root@localhost
Warning: Permanently added '[localhost]:2222' (ECDSA) to the list of known hosts.
...
root@localhost's password:
Last login: Thu Apr 25 15:08:21 2019
[root@foreman ~]#

Very good. Let's verify that routing and name resolution is working:

[root@foreman ~]# ping www.google.com
PING www.google.com (172.217.6.68) 56(84) bytes of data.
64 bytes from sfo07s17-in-f68.1e100.net (172.217.6.68): icmp_seq=1 ttl=51 time=27.8 ms
64 bytes from sfo07s17-in-f68.1e100.net (172.217.6.68): icmp_seq=2 ttl=51 time=27.4 ms
^C

Awesome, we've got our basic environment up and running. Next time we'll install Foreman and get started on configuration.

Social Justice And The Ratchet Effect

It occurs to me that the parameters of "socially just" behavior for any particular group are subject to the ratchet effect. I doubt I'm the first person to ever make this connection, but I don't recall ever seeing it explicitly discussed before, so I figured I'd put down a few notes on the subject.

In a nutshell the ratchet effect says that, given the right circumstances, attitudes towards/treatements of a particular phenomena tend to drift in one direction in an unbounded fashion, even when such drift might be hard to justify or may be counter-productive in the long run. The example I'm most familiar with is the criminal justice system, where there are many incentives to be "tough on crime" and very few incentives to do the opposite. As such, absent deliberate activity to the contrary, the number of infractions on the books tends to grow over time, as do the associated penalties. It's not the case that the people making the rules are actively trying to be "tougher", but rather an emergent property that stems from the fact that all actors in the system are slightly biased in favor of turning the knob "up" rather than "down".

So what does this have to do with social justice? The example from my workplace which kicked off this train of thought:

Person A mentions, in passing, that they used to patronize Hooters.
Person B accuses Person A, at some length and volume, of creating a hostile work environment.

What's interesting about this exchange is that there was a great deal of private "WTF?-ing" regarding Person B's behavior, but no one said anything in the associated public forum. And thus was the Overton window nudged leftwards.

This leads me to meditate on the factors that led to the divergence between the private and public response. I'm probably more inclined than most to tell Person B that they're Doing It Wrong™, but I kept my yap shut just like everyone else. I was ready to quote Person B chapter and verse about "severe and pervasive", but ultimately came to the conclusion that it just wasn't worth the effort. Why get into a public tussle with this individual when the return is so very abstract?

Seems like a pretty clear collective action problem, if the private communications are any guide. There are lots of people who disagreed with Person B's assessment, but the individuals involved all came to the (probably rational) conclusion that the potential benefits of speaking up wasn't worth the bother.

Which brings us back to the ratchet effect. Stipulating that my above characterization is true, I've described an environment where behavioral norms will tend to drift towards wokeness even though such drift may not accurately represent the collective judgement of the group as a whole. There's just not much incentive for people to apply coutervailing pressure against the drift, and basically zero incentive to try to yank the window in an anti-woke direction.

A Brief Comparison of Contemporary Provisioning Systems

Let's travel down into the bowels of the stack and see what's going on with provisioning systems these days. What with the predominance of cloud deployments the ability to mess with actual, physical hardware is something of a niche specialty. I can never tell exactly where I'm going to end up, though, so it pays to keep abreast of these sorts of things.

I did a bunch of Googling and turned up the following FOSS offerings:

A couple of these can be qual'd out after a brief review of their documentation:

OpenQRM: Looks like mostly a commercial product with a community edition. I couldn't find source code or a wiki or anything like that.
Satellite: This is RedHat's offering, and seems to be somewhat bound up with channels and RHN and all that jazz. Would rather avoid that, even if it's technically FOSS.

This still leaves us with a bevy of contenders. I spent some time combing through everyone's docs and pulled out what I think are some salient capabilities:

Name	Actively Maintained?	Provisioning OSes	Bare Metal Provisioning?	VM Provisioning?	Cloud Provisioning?	Additional Commentary
Cobbler	Y	Focused on RedHat and RedHat derivatives.	via PXE	N	N
FAI	Y	Many Linux distros	via PXE	"Automated installation of virtual machines using KVM, XEN or VirtualBox and Vserver"	N
Foreman	Y	"RHEL and derivatives (CentOS, Scientific Linux, Oracle Linux), Fedora, Ubuntu, Debian, OpenSUSE, SLES, CoreOS, Solaris, FreeBSD, Juniper Junos, Cisco NX-OS"	via PXE	Y	Y	Heavyweight system, lots of effort to get started, Puppet-centric. See post here for evaluation to date.
MaaS	Y	"Ubuntu, CentOS, Windows, and RHEL"	via PXE	N	N
m23	Y	"Debian, Ubuntu, and Kubuntu"	via PXE	Y	N
RackHD	No-ish	Many	via PXE	N	N
Razor	Y	Many	via PXE	N	N	Easy to set up and get started with. Needs additional effort to make production-worthy. See detailed notes (1, 2, 3, 4)
Spacewalk	Y	Focused on RedHat and derivates	delegated to bundled Cobbler server	Y	N
Stacki	Y	Many(?)	via PXE	N	N
xCAT	Y	"RHEL, SLES, Ubuntu, Debian, CentOS, Fedora, Scientific Linux, Oracle Linux, Windows, Esxi, RHEV, and more!"	via PXE	N	N

Everyone does bare metal provisioning via PXE, so its not really a differentiator. What's going to set these various packages apart as I mess around with them are all the various bells and whistles and quality of life improvements that they do (or do not) implement: which OSes do they support, how to they handle handoff to configuration management systems, etc.

I think I'm going to start with Foreman, since it claims that it can not only do bare metal, but also virtualization and some cloud providers (see "5.2 Compute Resources"). That's sort of the holy grail, being able to get an OS up and running across bare metal, VM, and cloud, so if Foreman really performs as advertised that's a significant edge over other packages.

Shout Out For Hyper Vest PRO

I started doing aikido awhile ago and found that, after a few months, the workouts weren't as intense as I wanted. I looked at a bunch of different wearable weight products and decided to give the Hyper Vest PRO a try on the strength of its online reviews.

Having worn it for over a year now I can attest that it's a very good product. I wear it under my gi when I'm working out, and you can barely even tell that I have it on. I very much appreciate the slim profile as it allows me to roll and fall without issue; I think that'd be basically impossible with most other vests on the market. It has also held up pretty well, despite the fact that I don't think it was designed with rolling/falling in mind.

My one complaint is that the individual weights don't react well to sweat; they tend to corrode quickly, and some of the ones that have been in the vest the longest have started to rust. That seems like kind of a major oversight (or maybe I just have weird body chemistry?).

Anyhow, I highly recommend it if you're in the market for a weight vest. It's on the pricey side, for sure, but totally worth it.

Wednesday, April 24, 2019

Messing Around With Ceph (3/N)

So, we have a fully-functional Ceph object store. Yay! Now let's add all the frosting needed to turn it into a distributed file system. What follows is based on the tail-end of the Storage Cluster Quick Start and the Fielsystem Quick Start.

As a reminder, here's where the cluster stands right now:

[ceph-deploy@admin ~]$ sudo ceph status
  cluster:
    id:     7832fc74-3164-454c-9789-e3e2fcc8940f
    health: HEALTH_WARN
            clock skew detected on mon.node2, mon.node3

  services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: node1(active), standbys: node2, node3
    mds: test_fs-2/2/2 up  {0=node3=up:active,1=node2=up:active}, 1 up:standby
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   2 pools, 60 pgs
    objects: 42  objects, 2.6 MiB
    usage:   546 MiB used, 2.5 GiB / 3.0 GiB avail
    pgs:     60 active+clean

Note the carping about "clock skew"; that seems to happen intermitently even though NTP is installed. I suspect it's a by-product of running in a virtual environment and suspending/resuming nodes and such.

Anyhow, we've got monitor daemons, manager daemons, ODS daemons... all the bits needed to use the object storage features of Ceph. In order to use the DFS features we have to set up the metadata server daemons. So let's install three of them, one per node, which will allow us to have a load-sharing config with a backup:

ceph-deploy mds create node1 node2 node3

The cluster now reports 3 MDS nodes, 1 active and 2 standby:

[ceph-deploy@admin ~]$ sudo ceph status
  cluster:
    id:     8a7aac93-aa5a-4328-8684-8392a2e9ce8f
    health: HEALTH_WARN
            clock skew detected on mon.node2, mon.node3

  services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: node1(active), standbys: node2, node3
    mds: test_fs-1/1/1 up  {0=node3=up:active}, 2 up:standby
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   41 MiB used, 2.9 GiB / 3.0 GiB avail
    pgs:

Remember that we wanted to have two nodes active and 1 standby; we'll get to that in a moment.

Let's create a filesystem:

sudo ceph osd pool create cephfs_data 30
sudo ceph osd pool create cephfs_metadata 30
sudo ceph fs new test_fs cephfs_metadata cephfs_data

So what's going on here? A Ceph filesystem makes use of two pools, one for data and one for metadata. The above creates two pools ("cephfs_data" and "cephfs_metadata") and then builds a filesystem named "test_fs" on top of them. All of that is pretty obvious, but what's up with the "30" that's passed as an argument during pool creation?

Placement groups "are an internal implementation detail of how Ceph distributes data"; the Ceph documentation provides a lot of guidance on how to calculate appropriate placement group sizes. '30' is the smallest value that prevents the cluster from complaining in my test setup; you'd almost certainly do something different in production.

Now, there's one little tweak to make before we can put the FS into service. The number of active metadata servers is controlled on an FS-by-FS basis; in order to go from 1 active/2 standby to 2 active/1 standby we need to update the max_mds attribute of the filesystem:

sudo ceph fs set test_fs max_mds 2

This tells Ceph that test_fs can have up to two metadata servers active at any given time. And 'lo, so reporteth the cluster:

[ceph-deploy@admin ~]$ sudo ceph status
  cluster:
    id:     8a7aac93-aa5a-4328-8684-8392a2e9ce8f
    health: HEALTH_WARN
            clock skew detected on mon.node2, mon.node3

  services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: node1(active), standbys: node2, node3
    mds: test_fs-2/2/2 up  {0=node3=up:active,1=node2=up:active}, 1 up:standby
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   2 pools, 60 pgs
    objects: 40  objects, 3.5 KiB
    usage:   46 MiB used, 2.9 GiB / 3.0 GiB avail
    pgs:     60 active+clean

  io:
    client:   938 B/s wr, 0 op/s rd, 3 op/s wr

Two MDS nodes are active and one is in standby, just like we wanted.

Alright then, let's mount this sucker up. First, we need to create a plain textfile that contain's the admin user's key. Get the key from the admin node:

[ceph-deploy@admin ~]$ cat ceph.client.admin.keyring
[client.admin]
        key = AQBttrdc5s8ZFRAApYevw6fxChKNLX2ViwQcHQ==
        caps mds = "allow *"
        caps mgr = "allow *"
        caps mon = "allow *"
        caps osd = "allow *"

and then copy it over to the client

[root@client ~]# cat > admin.secret
AQBttrdc5s8ZFRAApYevw6fxChKNLX2ViwQcHQ==
^D

This file will be used to auth against the Ceph cluster during the mount process. Next, make a mount point:

[root@client ~]# mkdir /mnt/mycephfs

and mount the filesystem:

[root@client ~]# mount -t ceph node1,node2,node3:/ /mnt/mycephfs -o name=admin,secretfile=admin.secret

The above make the client aware of all three monitor daemons (yay HA) and mounts the filesystem as the admin user. Et voila!:

[root@client ~]# mount | grep ceph
10.0.0.3,10.0.0.4,10.0.0.5:/ on /mnt/mycephfs type ceph (rw,relatime,name=admin,secret=,acl,wsize=16777216)
[root@client ~]# ls -la /mnt/mycephfs
total 0
drwxr-xr-x  1 root root  0 Apr 18 00:54 .
drwxr-xr-x. 4 root root 38 Apr 18 18:30 ..

So, can it dance?

[root@client mycephfs]# cd /mnt/mycephfs/
[root@client mycephfs]# yum install sysbench
...
[root@client mycephfs]# sysbench --test=fileio --file-total-size=500M --file-test-mode=rndrw prepare
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.17 (using system LuaJIT 2.0.4)

128 files, 4000Kb each, 500Mb total
Creating files for the test...
Extra file open flags: (none)
Creating file test_file.0
...
Creating file test_file.127
524288000 bytes written in 13.77 seconds (36.32 MiB/sec).

So far, so good, right? Looks like its a filesystem. And, after mounting the filesystem, we can see the files from the admin node:

[ceph-deploy@admin ~]$ ls -l /mnt/mycephfs/
total 512000
-rw------- 1 root root 4096000 Apr 18  2019 test_file.0
...

Now, for comparison, here's the prepare stage using just a single /dev/sdb device:

[root@node1 ~]# mkdir /mnt/test
[root@node1 ~]# mount /dev/sdb /mnt/test
[root@node1 ~]# cd /mnt/test
[root@node1 test]# sysbench --test=fileio --file-total-size=500M --file-test-mode=rndrw prepare
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.17 (using system LuaJIT 2.0.4)

128 files, 4000Kb each, 500Mb total
Creating files for the test...
Extra file open flags: (none)
Creating file test_file.0
...
Creating file test_file.127
524288000 bytes written in 1.85 seconds (269.93 MiB/sec).

Which tells us... not a whole lot, except maybe don't expect any sort of performance if you build a Ceph cluster on your laptop. It does demonstrate that there's non-trivial overhead, but that's exactly what we'd expect. All of the bookkeeping for a DFS is way more complicated than writing to a single, XFS file system.

And that pretty much concludes the messing around. As noted over and over (and over...) nothing I do on this here blog should be treated as production-worthy unless I explicitly say so; this is doubley-true in the case of these Ceph experiments. There's a tremendous amount of thought and tuning that goes into putting together a DFS suitable for production. Ceph has a lot of knobs to turn, can be made very smart about data placement and failure domains, and so on. That said, I am quite pleased with the experience I've documented here; the ceph-deploy tool in particular makes things very easy. Were I to need a DFS in production I expect that Ceph would easily make the top of the shortlist.

Thursday, April 18, 2019

Messing Around With Ceph (2/N)

Alright... all of our VMs have been prepped for Ceph per the preflight checklist. Let's set up a filesystem! For what follows I'll (mostly) be using the Storage Cluster Quick Start as a guide. Commands are executed on the admin node as the ceph-deploy user unless otherwise noted.

Gather all the host keys ahead of time to avoid the "do you want to add this key" prompts:

ssh-keyscan admin node1 node2 node3 client > /home/ceph-deploy/.ssh/known_hosts

Install the ceph-deploy utility (and dependency python-setuptools):

sudo yum install -y ceph-deploy python-setuptools

Write out an initial configuration to the monitor nodes:

ceph-deploy new node1 node2 node3

At this point nothing is running, and nothing much has been installed. Next step is to install all the Ceph packages everywhere:

ceph-deploy install admin node1 node2 node3 client

The first time I did this, after following the preflight checklist, I got the following error:

[ceph_deploy][ERROR ] RuntimeError: NoSectionError: No section: 'ceph'

Apparently there's a clash between what the documentation says and what the deployment automation expects; both of them want to put repo config in the file ceph.repo. If you hit the error above you can fix it via

for host in node1 node2 node3 admin client; do
  vagrant ssh $host -- sudo mv /etc/yum.repos.d/ceph.repo /etc/yum.repos.d/ceph-deploy.repo;
done

At this point the Ceph binaries are now installed everywhere, which means that we can start bringing up services. The monitor daemons come online first, which makes sense when you recall that their job is to house a consistent, distributed view of all the cluster state. Here we go:

ceph-deploy mon create-initial

If you watch the spew to console when you run this command you'll notice that the script brings up the monitor daemons on all the nodes originally specified via ceph-deploy new. Once the daemons are running the script then goes back and waits until the daemons are reporting a quorum. Finally, it spits out a bunch of keying/auth material into the current directory. Install the keying material (and some config files) on the admin and cluster nodes:

ceph-deploy admin admin node1 node2 node3

With the config and keys installed you can now check the status of the cluster (which must be done as root so you can read the keys):

[ceph-deploy@admin ~]$ sudo ceph status
  cluster:
    id:     8a7aac93-aa5a-4328-8684-8392a2e9ce8f
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: no daemons active
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

This shows that the cluster is up, that the monitors have a quorum, and that nothing else has been installed so far. All of which is exactly as we'd expect.

Next up, deploy the manager daemons:

ceph-deploy mgr create node1 node2 node3

What does the cluster look like now?

[ceph-deploy@admin ~]$ sudo ceph status
  cluster:
    id:     8a7aac93-aa5a-4328-8684-8392a2e9ce8f
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: node1(active), standbys: node2, node3
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

The manager daemons are all up and running, with 1 node active and two acting as standbys. The manager daemons don't have an active/active mode, so this is as expected as well.

The last thing we'll tackle in this post is bringing up the OSDs, which are the things that actually store the data, online. Here you go:

for host in node1 node2 node3; do
  ceph-deploy osd create --data /dev/sdb $host;
done

This brings up an OSD on each cluster node, configured to use its second disk (/dev/sdb) for storage. Here's what our cluster looks like now:

[ceph-deploy@admin ~]$ sudo ceph status
  cluster:
    id:     8a7aac93-aa5a-4328-8684-8392a2e9ce8f
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: node1(active), standbys: node2, node3
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   41 MiB used, 2.9 GiB / 3.0 GiB avail
    pgs:

All the OSDs are reporting as up, again as expected.

And we'll leave it at that for the time being. Let's recap what we've done so far. We:

Generated an initial config for the monitor nodes.
Installed the Ceph binaries everywhere.
Brough up the initial set of mon daemons.
Distributed config and keys.
Brought up mgr daemons.
Brought up OSDs, configured to use the second drive in each VM for storage.

At this point the cluster is a fully functional object store, but isn't yet ready for use as a DFS. Stay tuned for our next exciting installment where we'll get the FS stuff up and running.

Messing Around With Ceph (1/N)

I decided I needed a DFS, so I did a brief survey of current (FOSS) offerings. Ceph came out on top due to its features and robust developer base, so I figured I'd mess around with it for a little bit, see if I could get it set up and useful.

The good news about Ceph is that it has a tremendous amount of documentation! The bad news about Ceph is... that it has a tremendous amount of documentation. It also consists of a bunch of different services, which can be set up in a variety of configurations, so figuring out the best way to lay everything out takes a little bit of digging. Here's what I've come up with so far:

Monitor nodes: These things maintain cluster state. In an HA setup they use Paxos, which means you need at least 3 for tinkering (5 for an n+2 setup in production).
Manager nodes: Provide a bunch of management functionality. The manager node docs say "It is not mandatory to place mgr daemons on the same nodes as mons, but it is almost always sensible". So three of those too.
Metadata Servers (aka "MDS"): These servers store all the metadata for the Ceph Filesystem. There's a lot of flexibility in how these are laid out. Towards the end of the architecture doc it says "Combinations of standby and active etc are possible, for example running 3 active ceph-mds instances for scaling, and one standby instance for high availability". So let's install three of these things, make two of them active and one of them standby.
Object Storage Daemons (aka "OSD"): These hold the data. In a normal installation these do most of the heavy lifting, and you'll have way more of them than any of the other daemons. Since we've already consumed 3 VMs lets go ahead and install an OSD on each.
Admin node: The preflight checklist strongly suggests a dedicated admin node.
Client: Finally, let's set up a node to act as an FS client.

So, 5 nodes total: 1 for administration, 1 as a client, and three of which will run 1 instance of each type of daemon.

So here's a Vagrantfile:

Vagrant.configure("2") do |config|
  config.vm.box = "centos/7"
  config.vm.provision :shell, path: "dns.sh"
  config.vm.synced_folder ".", "/vagrant", type: "rsync",
    rsync__exclude: "vdisks/"

  config.vm.define "cache" do |cache|
    cache.vm.hostname = "cache"
    cache.vm.network "private_network", ip: "10.0.0.254",
      virtualbox__intnet: true
    cache.vm.provision :shell, path: "dns.sh"
    cache.vm.provision :shell, path: "cache.sh"
  end

  ips = Hash[
    "node1" => "10.0.0.3", "node2" => "10.0.0.4", "node3" => "10.0.0.5"
  ]
  vdisk_dir = "./vdisks"
  unless Dir.exist?(vdisk_dir)
    Dir.mkdir(vdisk_dir)
  end
  (1..3).each do |i|
    config.vm.define "node#{i}" do |node|
      hostname = "node#{i}"
      node.vm.hostname = hostname
      node.vm.network "private_network", ip: ips[hostname],
        virtualbox__intnet: true
      node.vm.provider "virtualbox" do |vb|
        vdisk_file = "#{vdisk_dir}/#{hostname}-ceph.vdi"
        unless File.exist?(vdisk_file)
          vb.customize [
            'createhd', '--filename', vdisk_file, '--variant', 'Fixed',
            '--size', 1024
          ]
        end
        vb.customize [
          'storageattach', :id,  '--storagectl', 'IDE', '--port', 1,
          '--device', 0, '--type', 'hdd', '--medium', vdisk_file
        ]
      end
      node.vm.provision :shell, path: "bootstrap.sh"
      node.vm.provision :shell, path: "ntp.sh"
    end
  end

  config.vm.define "admin" do |admin|
    admin.vm.hostname = "admin"
    admin.vm.network "private_network", ip: "10.0.0.2",
      virtualbox__intnet: true
    admin.vm.provision :shell, path: "bootstrap.sh"
  end

  config.vm.define "client" do |client|
    client.vm.hostname = "client"
    client.vm.network "private_network", ip: "10.0.0.6",
      virtualbox__intnet: true
    client.vm.provision :shell, path: "bootstrap.sh"
  end

end

Various parts of the above blatantly stolen from the Vagrant tips page and EverythingShouldBeVirtual.

There are a handful of marginally interesting things going on in the Vagrantfile:

Note that there's a VM called cache; it doesn't have anything to do with Ceph. As I was building (and rebuilding) the other nodes it quickly became apparent that downloading RPMs consumes the vast majority of the setup time for each machine. So cache is just going to run a caching Squid proxy, which will speed up the build time for the remaining nodes considerably.
There's some jiggery-pokery which adds an additional disk to each of the Ceph nodes; Ceph really likes to have a raw disk device on which to store things.

Here's dns.sh:

#!/usr/bin/env bash

cat <<EOM > /etc/hosts
127.0.0.1 localhost localhost.localdomain
10.0.0.2 admin admin.localdomain
10.0.0.3 node1 node1.localdomain
10.0.0.4 node2 node2.localdomain
10.0.0.5 node3 node3.localdomain
10.0.0.6 client client.localdomain
10.0.0.254 cache cache.localdomain
EOM

Nothing fancy here, just make sure all names resolve appropriately.

And cache.sh:

#!/usr/bin/env bash

yum install -y squid

sed -i '/cache_dir/ s/^#//' /etc/squid/squid.conf

service squid start

sed -i '/\[main\]/a proxy=http://cache.localdomain:3128' /etc/yum.conf
sed -i 's/enabled=1/enabled=0/' /etc/yum/pluginconf.d/fastestmirror.conf

yum update -y

This installs Squid, enables caching, and sets up yum to direct requests through Squid.

ntp.sh:

#!/usr/bin/env bash

yum install -y ntp ntpdate ntp-doc

grep -o 'node[1-9].localdomain' /etc/hosts | grep -v `hostname` | sed -e 's/^/peer /' >> /etc/ntp.conf

ntpdate 0.centos.pool.ntp.org

service ntpd start

This does NTP setup on the 3 main nodes. The Ceph documentation strongly suggests that, in a multi-monitor setup, the hosts running the monitoring daemons should be set up as NTP peers of each other.

Here's bootstrap.sh:

#!/usr/bin/env bash

sed -i '/\[main\]/a proxy=http://cache.localdomain:3128' /etc/yum.conf
sed -i 's/enabled=1/enabled=0/' /etc/yum/pluginconf.d/fastestmirror.conf

cat << EOM > /etc/yum.repos.d/ceph-deploy.repo
[ceph-noarch]
name=Ceph noarch packages
baseurl=https://download.ceph.com/rpm-luminous/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
EOM

yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

yum update -y

yum install -y yum-plugin-priorities

useradd -c /home/ceph-deploy -m ceph-deploy

echo "ceph-deploy ALL = (root) NOPASSWD:ALL" > /etc/sudoers.d/ceph-deploy
chmod 0440 /etc/sudoers.d/ceph-deploy

mkdir -p -m 700 /home/ceph-deploy/.ssh
cat <<EOM > /home/ceph-deploy/.ssh/config
Host admin
   Hostname admin
   User ceph-deploy
Host node1
   Hostname node1
   User ceph-deploy
Host node2
   Hostname node2
   User ceph-deploy
Host node3
   Hostname node3
   User ceph-deploy
Host client
   Hostname client
   User ceph-deploy
EOM
mv /vagrant/id_rsa* /home/ceph-deploy/.ssh/
cp /home/ceph-deploy/.ssh/id_rsa.pub /home/ceph-deploy/.ssh/authorized_keys
chown -R ceph-deploy:ceph-deploy /home/ceph-deploy/.ssh
chmod 0600 /home/ceph-deploy/.ssh/*

This is based largely off of the preflight checklist mentioned above. Note the tweaks to the files under /etc/yum, which configures the machine to use cache for downloading RPMs and disables the fastestmirror plugin (which isn't needed 'cause we're using a local caching proxy). Note also the distribution of a shared SSH key; the ceph-deploy utility needs passwordless SSH access to all nodes. I couldn't figure out an elegant way to generate a shared keypair during vagrant up. So instead I did

ssh-keygen -C 'ceph deploy user' -f id_rsa

to create the keypair and put the resulting files into the Vagrant directory where they're available during execution of bootstrap.sh.

This completes the preliminaries; all of the nodes are ready for installation and configuration of Ceph proper. We'll pick up there next time.

Saturday, April 13, 2019

Finding Candidate Open Source Distributed File Systems

In my ever-present quest to keep myself entertained at work I've decided that I should mess around with distributed file systems for a bit. The "comparison" page on Wikipedia lists a lot of systems, but its hard to tell which of them are truly viable options for production use and which of them are research projects. This post records a little bit of research in the hopes that it'll be useful to posterity.

Eliminating all systems marked "proprietary" leaves us with:

Alluxio
BeeGFS
Ceph
GlusterFS
MooseFS
Quantcast File System
LizardFS
Lustre
OpenAFS
OpenIO
SeaweedFS
Tahoe-LAFS
HDFS
XreemFS
Ori

If I'm going to be running something in production I'd prefer that it be actively maintained. Projects which don't appear to be under active development anymore:

Quantcast File System: Last release was in 2015.
XtreemFS: Last release was in 2015.

Interesting that this only knocked out two players; looks like most of the file systems listed on the Wikipedia page are still under active developement.

How about robustness of development? Which of these are someone's thesis project and which of them have a large development community? Presented in descending order of number of contributors:

Alluxio: 943 contributors
Ceph: 737 contributors
GlusterFS: 191 contributors
SeaweedFS: 59 contributors
LizardFS: 35 contributors, also a small-ish number.
OpenIO: 23 contributors
MooseFS: 9 contributors
BeeGFS: Their source code repository doesn't appear to have any data on number of contributors.
Lustre: Not at all obvious from their repository.
OpenAFS: Not obvious from their repository.
Tahoe-LAFS: Not obvious from their repository
HDFS: Not obvious from their repository

Interesting that Alluxio, a system I don't think I've ever heard of before, has the largest number of contributors. Also, some of the systems that don't have data readily available (Lustre and HDFS in particular) are known contenders. So number of contributors should really only be interpreted as a soft marker of popularity and/or robustness, useful for prioritizing investigations but not necessarily qualifying out.

What flavors of storage do they support? I'm mostly interested in a good GFS, with object store a nice-to-have and block devices a distant third.

Alluxio: "Alluxio sits between computation and storage in the big-data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface". Not really what I want.
Ceph: File system via CephFS, object store, and block storage.
Gluster: File system via GlusterFS (and maybe object store via SWIFT?)
Moose: File system via MooseFS.
LizardFS: File system.
Lustre: File system.
OpenAFS: File system.
OpenIO: Object store (plus non-free/proprietary FUSE connector)
SeaweedFS: Object store with optional FS support.
Tahoe-LAFS: Cloud storage-ish model. Doesn't looks like it fits the bill.
HDFS: Specifically designed for streaming access for large-scale computation; not a general-purpose DFS.

At this point Ceph looks like the front-runner in terms of features and robustness. Next question: Does it run on CentOS 7? According to the Ceph OS Recommendations it does, as long as you don't use btrfs. So Ceph seems like a good place to start dabbling for the time being.

Bonus Round: HA Vault On Top of Etcd!

When we left off we'd just finished converting the etcd cluster to speak TLS. What we didn't get around to, however, was taking the Vault implementation from a single, stand-lone server to an HA implementation. Let's finish that bit of bootstrapping now.

In order to run an HA vault you need to select a storage backend that supports HA operations. Luckily for us, one of the available backends is etcd. Getting that configured is straightforward:

[root@turtle1 ~]# cat /etc/vault/vault_etcd.hcl
storage "etcd" {
  address = "https://turtle1.localdomain:2379,https://turtle2.localdomain,https://turtle3.localdomain"
  ha_enabled = "true"
}
listener "tcp" {
  address     = "turtle1.localdomain:8200"
  tls_cert_file = "/etc/pki/tls/certs/turtle1.localdomain.crt"
  tls_key_file = "/etc/pki/tls/private/turtle1.localdomain.key"
  tls_disable_client_certs = 1
}

api_addr    = "https://turtle1.localdomain:8200"

The storage stanza, which previously pointed to local filesystem storage, is now pointing to the etcd cluster. That's all there is to it.

"But wait!", you may be saying if you've spent a lot of time in the ops trenches, "I didn't tell Vault about its peers". That right, you didn't, and you don't need to; Vault is going to take care of all of that for you via its HA storage system. So just start up your vault nodes using the config file above.

Done? Good. Ok, let's check the vault status:

[vagrant@turtle1 ~]$ export VAULT_ADDR=https://turtle1.localdomain:8200
[vagrant@turtle1 ~]$ vault status
Key                Value
---                -----
Seal Type          shamir
Initialized        false
Sealed             true
Total Shares       0
Threshold          0
Unseal Progress    0/0
Unseal Nonce       n/a
Version            n/a
HA Enabled         true

Note that "HA Enabled" is now "true", which tells you that the HA subsystem is working correctly. Note also that, because we switched from file storage to etcd storage, the cluster is uninitialized. So let's init and unseal as per usual...

[vagrant@turtle1 ~]$ vault init
...
[vagrant@turtle1 ~]$ vault operator unseal
...
[vagrant@turtle1 ~]$ vault operator unseal
...
[vagrant@turtle1 ~]$ vault operator unseal
...

If everything has been set up correctly you should see something like the following log messages once the first node is unsealed:

2019-04-05T02:15:59.862Z [INFO]  core: vault is unsealed
2019-04-05T02:15:59.862Z [INFO]  core: entering standby mode
2019-04-05T02:15:59.862Z [INFO]  core.cluster-listener: starting listener: listener_address=10.0.0.2:8201
2019-04-05T02:15:59.862Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=10.0.0.2:8201
2019-04-05T02:15:59.882Z [INFO]  core: acquired lock, enabling active operation

That bit about acquiring the lock means that the node (as expected) has taken the "active" role. Vault uses an active/standby architecture consisting of 1 active node and multiple standbys that can take over if the active node dies. The nice thing about how that's been implemented is that you can conduct operations against any node; the standby nodes will simply forward your request on to the active node.

Note that a cluster node has to be unsealed before it can join the cluster, otherwise it won't be able to read the synchronization data that's stored in etcd. So, let's unseal another node:

[vagrant@turtle2 ~]$ vault operator unseal
...
[vagrant@turtle2 ~]$ vault operator unseal
...
[vagrant@turtle2 ~]$ vault operator unseal
Unseal Key (will be hidden):
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           5
Threshold              3
Version                1.1.0
Cluster Name           vault-cluster-30295974
Cluster ID             e0e74985-2d25-c4bb-6645-7ffa1327ba9c
HA Enabled             true
HA Cluster             https://turtle1.localdomain:8201
HA Mode                standby
Active Node Address    https://turtle1.localdomain:8200

W00t! The second node can read the cluster synchronization data. Unsealing the third node will give you a similar result.

So, does the HA work? Let's find out! "^C" the server on turtle1, and see what shows up in the log for turtle2:

2019-04-05T02:16:42.457Z [INFO]  core: acquired lock, enabling active operation
2019-04-05T02:16:42.510Z [INFO]  core: post-unseal setup starting
2019-04-05T02:16:42.514Z [INFO]  core: loaded wrapping token key
2019-04-05T02:16:42.514Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
2019-04-05T02:16:42.523Z [INFO]  core: successfully mounted backend: type=system path=sys/
2019-04-05T02:16:42.523Z [INFO]  core: successfully mounted backend: type=identity path=identity/
2019-04-05T02:16:42.523Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2019-04-05T02:16:42.553Z [INFO]  core: successfully enabled credential backend: type=token path=token/
2019-04-05T02:16:42.553Z [INFO]  core: restoring leases
2019-04-05T02:16:42.553Z [INFO]  rollback: starting rollback manager
2019-04-05T02:16:42.557Z [INFO]  expiration: lease restore complete
2019-04-05T02:16:42.564Z [INFO]  identity: entities restored
2019-04-05T02:16:42.568Z [INFO]  identity: groups restored
2019-04-05T02:16:42.571Z [INFO]  core: post-unseal setup complete

turtle2 took over the "active" role pretty much instantaneously.

The last thing we really want to do is import the root cert we generated previously. The Vault API provides a mechanism for importing CA material. All we need is the cert that we generate and its associated private key.

Well, crap... it turns out that there's no way to get the private key for the CA after the fact. So, one revision that I'd make to the Vault bootstrapping process is to use the exported version of the call to generate the CA and then stuff the key somewhere that it could be imported into the HA cluster.

This brings us to the end of a rather long set of posts. Let's review:

A review of off-the-shelf systems that might be used to consistently distribute data.
Setup of an HA etcd cluster, sans TLS.
Selection of an appropriate Python driver for working with an HA etcd cluster.
A trivial example of a program which can dynamically update itself in response to changes in etcd keys.
A series of posts (1, 2, 3, 4) on monitoring the cluster using Prometheus and Grafana.
Creating a bootstrap Vault instance to use as an initial certificate authority.
Using the bootstrap instance to generate an initial CA and host cert, and then using the cert to enable TLS communication with Vault.
And, finally, generating all the certs and configuring the etcd cluster for TLS operation.

The end state is that we've gone from zero to an HA Vault instance running on top of TLS-secured etcd which is capable of self-monitoring. This is close(-ish) to something that you could use as a root-of-trust in a production environment, though you'd still want to go through hardening, figure out an appropriate auth scheme, and establish a backup/disaster recovery procedure.

Tuesday, April 09, 2019

Yonkyo, (Slightly) Demystified

I recently started learning yonkyo and was utterly failing to understand what was supposed to be happening. So I went on the tubes, hoping to find a decent explanation of the actual mechanics of the technique. Being utterly underwhelmed by what I found online, and having it recently explained to me in detail by someone who knows what they're doing, I'm putting it down on paper for the edification of future generations.

Here's a picture of the bones and major nerves of the arm; note in particular the presence of the "superficial radial nerve". This nerve runs along the radius for some distance, very close to the surface of the skin. The essence of yonkyo is to cause a lot of discomfort by compressing this nerve against the bone.

The mechanics of how this is done are pretty straight-forward; HT to my instructors for their explanation. So:

Imagine that uke's arm is like a bokken, where the hand is the hilt and the inside of the arm is the back of the blade.
When applying yonkyo to uke's right arm (like the diagram above), nage should first gain control of the limb by gripping uke's wrist with eir right hand, as if nage were grabbing a bokken.
nage should then place eir left hand in front of eir right, again more or less as if ey were grabbing a bokken.

nage should end up with the proximal phalanx and/or middle knuckle of the left index finger conveniently spanning the superficial radial nerve. To apply yonkyo just sequeeze so that the phalanx/knuckle compresses the nerve against the bone. Doing this is highly unconfortable for uke, which is the essence of the technique.

Note that the above completely ignores how do you get into that position and what you do once you've got it; there are plenty of videos out there which show that type of gross detail.

I'd also like to note that I'm typically very skeptical of nerve/pressure point techniques in general, since they're often difficult to execute well consistently. Yonkyo seems to be an exception, since the nerve is easy to find and engage once you know its there.

Implementing TLS for an Etcd Custer (3/N)

Alrighty then... there's been a tremendous amount of yak shaving to get up to this point. Which, when you think about it, isn't entirely surprsing since we basically stood up our own mini-CA. Anyway, at this point we're ready to make the cluster talk TLS.

We've already installed a cert on the first cluster; let's rinse and repeat for the second node. Assuming your Vault instance is already unsealed, grab the CA certificate and trust it:

[root@turtle2 ~]# curl https://turtle1.localdomain:8200/v1/pki/ca/pem > /etc/pki/ca-trust/source/anchors/vault_root.pem
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1171  100  1171    0     0  13353      0 --:--:-- --:--:-- --:--:-- 13459
[root@turtle2 ~]# update-ca-trust

Auth to Vault:

[root@turtle2 ~]# export VAULT_ADDR=https://turtle1.localdomain:8200
[root@turtle2 ~]# vault login
Token (will be hidden):
Success! You are now authenticated. The token information displayed below
...

and the generate a cert and move it into position:

[root@turtle2 ~]# vault write -format json pki/issue/etcd-cluster common_name=turtle2.localdomain > turtle2.localdomain.json
[root@turtle2 ~]# jq -r .data.certificate turtle2.localdomain.json > /etc/pki/tls/certs/turtle2.localdomain.crt
[root@turtle2 ~]# jq -r .data.private_key turtle2.localdomain.json > /etc/pki/tls/private/turtle2.localdomain.key

The process for doing the third node is pretty much identical.

At this point we've checked off all of the required materials. We've got a CA cert and each of the etcd nodes has its own cert as well. At this point we should just be able to update the config and things should automagically work. Except...

Apparently there's a minor issue where an etcd cluster that is initially started in HTTP mode doesn't deal gracefully with the conversion to HTTPS. Since we've no data to save its easiest just to plaster the cluster and start over:

[root@turtle1 ~]# rm -rf /var/lib/etcd/default.etcd

Repeat for other nodes as needed.

Here's a new config, based on the guidance from the etcd security docs:

ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://10.0.0.2:2380"
ETCD_LISTEN_CLIENT_URLS="https://10.0.0.2:2379,https://127.0.0.1:2379"
ETCD_NAME="turtle1"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://turtle1.localdomain:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://turtle1.localdomain:2379"
ETCD_INITIAL_CLUSTER="turtle1=https://turtle1.localdomain:2380,turtle2=https://turtle2.localdomain:2380,turtle3=https://turtle3.localdomain:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_CERT_FILE="/etc/pki/tls/certs/turtle1.localdomain.crt"
ETCD_KEY_FILE="/etc/pki/tls/private/turtle1.localdomain.key"
ETCD_PEER_CERT_FILE="/etc/pki/tls/certs/turtle1.localdomain.crt"
ETCD_PEER_KEY_FILE="/etc/pki/tls/private/turtle1.localdomain.key"
ETCD_PEER_CLIENT_CERT_AUTH="true"
ETCD_PEER_TRUSTED_CA_FILE="/etc/pki/tls/certs/ca-bundle.crt"

This config will cause etcd to speak TLS both between cluster members and between client/server. The above is mostly self-explanatory, but there are a couple of subtleties to keep in mind:

ETCD_LISTEN_PEER_URLS and ETC_LISTEN_CLIENT_URLS tell etcd which interfaces to bind to; they expect IPs rather than names.
Other config items which specify addresses should use FQDNs which match the common names used in the certificates that have been generated.

At this point you can restart the cluster and check it health:

[vagrant@turtle1 ~]$ etcdctl cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
; error #1: dial tcp 127.0.0.1:4001: connect: connection refused

error #0: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #1: dial tcp 127.0.0.1:4001: connect: connection refused

WTF, etcdctl? Why you trying to talk to port 4001? Well, if you read the command line help you find the following:

--endpoints value                a comma-delimited list of machine addresses in the cluster (default: "http://127.0.0.1:2379,http://127.0.0.1:4001")

Ok then... random, but simple enough to fix:

[vagrant@turtle1 ~]$ etcdctl --endpoints 'https://turtle1.localdomain:2379' cluster-health
member 239a42cfc49a9543 is healthy: got healthy result from https://turtle2.localdomain:2379
member 6f3d69975b47e204 is healthy: got healthy result from https://turtle3.localdomain:2379
member fb6279d768c990c6 is healthy: got healthy result from https://turtle1.localdomain:2379
cluster is healthy

Sweet. They all talking TLS?

[vagrant@turtle1 ~]$ etcdctl --endpoints 'https://turtle1.localdomain:2379' member list
239a42cfc49a9543: name=turtle2 peerURLs=https://turtle2.localdomain:2380 clientURLs=https://turtle2.localdomain:2379 isLeader=false
6f3d69975b47e204: name=turtle3 peerURLs=https://turtle3.localdomain:2380 clientURLs=https://turtle3.localdomain:2379 isLeader=false
fb6279d768c990c6: name=turtle1 peerURLs=https://turtle1.localdomain:2380 clientURLs=https://turtle1.localdomain:2379 isLeader=true

Mission accomplished!

Now, in a separate-but-related issue, after I did the TLS conversion above I was still seeing messages like the following in the log:

Apr  5 00:42:43 localhost etcd: rejected connection from "10.0.0.4:46290" (error "tls: first record does not look like a TLS handshake", ServerName "")
Apr  5 00:42:45 localhost etcd: rejected connection from "10.0.0.3:41394" (error "tls: first record does not look like a TLS handshake", ServerName "")
Apr  5 00:42:48 localhost etcd: rejected connection from "127.0.0.1:42742" (error "tls: first record does not look like a TLS handshake", ServerName "")

At first I thought some legacy something in the etcd cluster was still trying to talk HTTP, but then I remembered that the Prometheus servers I'd set up were still doing their thing. It turned out to be easy enough to update the scrape config to use HTTPS:

- job_name: 'etcd'

    scheme: https

    static_configs:
      - targets: ['turtle1.localdomain:2379', 'turtle2.localdomain:2379', 'turtle3.localdomain:2379']

Saturday, April 06, 2019

A Response to Siggy on Gender Metaphysics

We interrupt our TLS shenanigans for a brief digression on the language and metaphysics of gender.

Siggy recently authored a response to some common issues in the ongoing TERF/trans contretemps. Some of the things he says caught my eye as worthy of discussion so, without further ado, let's get started.

I have a very Wittgensteinian perspective. Ultimately, gender is defined by pointing.

So far, so good; I pretty much agree with the above from a purely descriptive stantpoint. I think the bulk of humanity has templates for gender terms, templates which encompass a wide variety of consideration, and that the designation of something as being a particular gender involves finding the best match among the templates.

But then he goes on to say

As it happens, a lot of people (both cis and trans) have strong feelings about what gendered judgments should or shouldn’t be made about them, and I don’t know why that is but maybe it’s a human psychology thing? I believe in resolving the language ambiguity in a way that reduces harm. That means not second-guessing people’s expressed genders, and reconsidering my snap judgments when they mismatch a person’s expressed gender. Under this model, I suppose you could say that the “essence” of gender is whatever that human psychology thing is, but there are exceptions to that rule, and really it’s just better to understand that there isn’t any essence at all.

So there's an obvious tension (bordering on contradiction) between the above and "definition by pointing". Given that he "believe[s] in resolving the language ambiguity in a way that reduces harm", I wonder if "understand" is really the best word to describe his stance? If you replace "understand that" with "act as if" it lessens the tension significantly; the two views can be reconciled by saying "people define gender by pointing, but in the interest of justice I choose to proceed as if the concept of gender has no substantive content".

Note that this dovetails with my observation that defining gender in terms of self-identification leads to content-less gender categories. This is also, I think, going to be a problem for Siggy momentarily.

Movin' on:

TERFs contend that trans women are reinforcing oppressive gender stereotypes by using symbols of femininity like makeup or dresses. I think trans women adopt feminine expressions because society is enforcing gender stereotypes upon them, and society refuses to treat them like women until they adopt sufficiently feminine expressions.

The above implies that society could, and presumably should, treat trans women like women without them having to adopt feminine expressions. Here's where I think that Siggy's stipulations pretty much blow up in his face; to what does the "women" in "like women" refer? According to Siggy its a gender term without "any essence at all", which makes it very difficult to conceive of the phrase "like women" having any substantive content either.

Again, I wonder if Siggy really means something slightly different. Substituting "treat them like" with "acknowledge them as" resolves the difficulty above; I can acknowlege someone as a woman ("you're a woman") even if the category itself is semantically empty.

And one last comment:

My theory is that if you actually abolished gender (instead of half-assing it like TERFs), it would not be much of a utopia.

Yeah, but Siggy... you see... I think you've already abolished gender. Let me posit a question: If neither "man" nor "woman" have "any essence at all", then what distinguishes one category from the other? Nothing, by definition. So the rational thing would be to just discard them as a linguistic atavism, right?

And yet people cling to these labels as if they have substantive meaning. Siggy says as much:

I suspect that whatever the human psychology thing is that makes people want to be seen as a man or a woman, would likely persist into a genderless future.

Yes! I agree, vigorously! But this is also strong evidence, contra Siggy, that gender concepts have substantive content. I raised this exact issue not so long ago:

This isn't what happens in real life; almost everybody says 'trans' in a way which suggests that the term is laden with meaning (ditto for the term 'woman'). So what gives?

"What gives", indeed?

Laying it out, bluntly: Siggy is between a rock and a hard place. In the interests of inclusivity he wants to treat gender as if it has no substantive content, the end result being that anyone can self-identify as any gender they prefer. But the outcome of this treatment is to render the whole notion of self-identification meaningless; why make use of a label if the label has no content?

Definitions exclude; that's the whole point of definitions. They draw (frequently fuzzy) boundaries around categories, designating things that are, and are not, members of the category. In this case there is discomfort with the idea of defining gender terms, because such definitions will inevitably exclude somebody. The proper response, IMHO, is to discard the terms entirely:

From a purely pragmatic standpoint there's no consensus about what gender terms mean, a consequence of which is that the use of these terms frequently sheds more heat than light.
As a matter of semantic hygiene we should reject meaningless terms. This is a stance that goes all the way back to (at least) Thomas Hobbes and his railing at the Catholic Church for saying things like "incorporeal substance" and "incorporeal body".

This fence-straddling, where sometimes gender has meaning and sometimes it doesn't, does absolutely zero in terms of productive dialogue. Knock it off already.

Implementing TLS for An Etcd Cluster (2/N)

Ok, so we've got our Vault instance up an running. Let's go make some certs!

The etcd docs include a good guide on security. Per Example 3 in the doc, we'll need the following materials:

A CA certificate
A certificate and key for each cluster member, signed by the CA.

First things first, we need to do a little bit of setup to make use of the Vault PKI engine. So, let's authenticate to the server and enable PKI:

[vagrant@turtle1 ~]$ vault login
Token (will be hidden): 
Success! You are now authenticated. The token information displayed below
is already stored in the token helper. You do NOT need to run "vault login"
again. Future Vault requests will automatically use this token.

Key                  Value
---                  -----
token                s.KG6rIkpFyovI6JWdoorTR5mv
token_accessor       Es6proDGPrzvhOZsiPdrGRNB
token_duration       ∞
token_renewable      false
token_policies       ["root"]
identity_policies    []
policies             ["root"]
[vagrant@turtle1 ~]$ vault secrets enable pki
Success! Enabled the pki secrets engine at: pki/

The docs on the PKI engine also suggest tweaking the maximum lifetime for certificates. Here we set it for 1 year:

[vagrant@turtle1 ~]$ vault secrets tune -max-lease-ttl=8760h pki
Success! Tuned the secrets engine at: pki/

So far, so good. Now let's move on to generation of the root cert:

[vagrant@turtle1 ~]$ vault write pki/root/generate/internal common_name=localdomain ttl=8760h
Key              Value
---              -----
certificate      -----BEGIN CERTIFICATE-----
MIIDNTCCAh2gAwIBAgIUCROAk3hiLsmnpCwvBhakfmAeaUswDQYJKoZIhvcNAQEL
BQAwFjEUMBIGA1UEAxMLbG9jYWxkb21haW4wHhcNMTkwNDA0MTg1MTQzWhcNMjAw
...

We've just generated a self-signed CA cert, which is good enough for experimentation but isn't recommended for production setups.

Now let's move on to the actual setup needed to issue certificates signed by our CA. First, configure the location of our issuing CA and associated CRL, which will be embedded in the certificates that we generate:

[vagrant@turtle1 ~]$ vault write pki/config/urls \
>     issuing_certificates="http://127.0.0.1:8200/v1/pki/ca" \
>     crl_distribution_points="http://127.0.0.1:8200/v1/pki/crl"
Success! Data written to: pki/config/urls

Note that I've used 127.0.0.1 for the present, because that's the only only place we're listening. That will need to change in the future as continue bootstrapping our Vault cluster.

A useful feature of Vault's PKI engine is that it lets you define "roles", basically different flavors or classes of certificates. Each role can have different default values, constraints on the type of certificates that can be issued, etc. When you ask Vault to issue a certificate you do so w.r.t. an existing role, so you gotta set up a role first. Here, we set up a role for certificates for our etcd cluster:

[vagrant@turtle1 ~]$ vault write pki/roles/etcd-cluster allowed_domains=localdomain allow_subdomains=true max_ttl=8760h
Success! Data written to: pki/roles/etcd-cluster

Hey Vault, gimme one o' them etcd-cluster certs for mah machine:

[vagrant@turtle1 ~]$ vault write -format json pki/issue/etcd-cluster common_name=turtle1.localdomain > turtle1.localdomain.json

turtle1.localdomain.json now contains the cert, its associated private key, and some other metadata. It's very important that you capture the output of the command; that's the only time the private key for the cert is ever available. I don't strictly understand why they chose to only retain the cert and not the private key, since the whole point of Vault is to store sensitive data, but it is what it is.

Now I'd like to stop and just marvel at how well that all worked. The last time I tried to do something like this it involved making a bunch of different directories and running a bunch of scripts which invoked openssl and all that jazz. Fragile and complicated. This is much, much cleaner.

Alright, let's install some certs! Grab the CA cert:

[root@turtle1 ~]# curl http://127.0.0.1:8200/v1/pki/ca/pem > /etc/pki/ca-trust/source/anchors/vault_root.pem
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1171  100  1171    0     0   421k      0 --:--:-- --:--:-- --:--:-- 1143k

and trust it

[root@turtle1 ~]# update-ca-trust

At least, that's how you do it on CentOS; YMMV if you're using something else. One thing that hasn't improved (much) since the last time I ventured this way is that there's still not much of a standard for where you're supposed to put your certificates. Just take a look in /etc/pki and tell me that it's self-explanatory...

Move the host's certificate and private key into place:

[vagrant@turtle1 ~]$ jq -r .data.certificate turtle1.localdomain.json > turtle1.localdomain.crt
[vagrant@turtle1 ~]$ jq -r .data.private_key turtle1.localdomain.json > turtle1.localdomain.key
[vagrant@turtle1 ~]$ sudo mv turtle1.localdomain.crt /etc/pki/tls/certs/
[vagrant@turtle1 ~]$ sudo mv turtle1.localdomain.key /etc/pki/tls/private/

And that's that! turtle1.localdomain now 1) trusts the Vault CA and 2) has its own cert. We now have enough cert material in place to enable TLS on our Vault instance and let it talk to the outside world. Here's the new config:

storage "file" {
  path = "/var/lib/vault/bootstrap_data"
}
listener "tcp" {
  address     = "turtle1.localdomain:8200"
  tls_cert_file = "/etc/pki/tls/certs/turtle1.localdomain.crt"
  tls_key_file = "/etc/pki/tls/private/turtle1.localdomain.key"
  tls_disable_client_certs = 1
}

In contrast to the previous config the server is now listening on a public IP and using TLS. Vault will request client certs by default, but we're not set up for that at present, so client cert checking is disabled.

Kill the Vault server and restart it using the new config:

[root@turtle1 ~]# vault server -config /etc/vault/single_host_config.hcl

and there you go, chatting with Vault using TLS:

[vagrant@turtle1 ~]$ export VAULT_ADDR=https://turtle1.localdomain:8200
[vagrant@turtle1 ~]$ vault status
Key                Value
---                -----
Seal Type          shamir
Initialized        true
Sealed             true
Total Shares       5
Threshold          3
Unseal Progress    0/3
Unseal Nonce       n/a
Version            1.1.0
HA Enabled         false

That's enough for now. When we pick up we'll generate a couple more certs and then reconfigure the etcd cluster.

Implementing TLS For An Etcd Cluster (1/N)

So one of the things that I deliberatly punted on in configuring an etcd cluster was any sort of security. Because security is hard and PKI is hard and on and on and...

etcd wants a variety of certificates to secure its operations, so I decided to circle 'round and see what's available these days in the way of open source PKI. One system which immediately caught my eye is Hashicorp Vault, in large part because it has a built-in PKI engine which reduces a lot of the complications associated with managing certificates. So I figured I'd give it a whirl and see whether I could get it to work. Usual caveats apply, not vetted for production etc.

So, first thing which needs to be done is to put together a minimal configuration file. Here's what I picked to start:

storage "file" {
  path = "/var/lib/vault/bootstrap_data"
}
listener "tcp" {
  address     = "127.0.0.1:8200"
  tls_disable = 1
}

This is about as stripped down as you can make it. It uses local disk for storage and listens on localhost. Note also the tls_disable directive; we can't turn on TLS because we don't (yet) have any certificates. This configuration is suitable for using a single machine to establish a root of trust. The perceptive reader will also note that I named the data directory "bootstrap_data"; later on I'm going to try to use this initial configuration to bootstrap into a full-blown, HA Vault config.

Name this file vault_boostrap_config.hcl and put it in your Vagrant directory, then update bootstrap.sh to install Vault:

yum install -y unzip
wget -nv https://releases.hashicorp.com/vault/1.1.0/vault_1.1.0_linux_amd64.zip
unzip vault_1.1.0_linux_amd64.zip
mv vault /usr/local/bin/
mkdir /var/lib/vault/
mkdir /etc/vault/
mv /vagrant/vault_bootstrap_config.hcl /etc/vault/bootstrap_config.hcl
rm vault_1.1.0_linux_amd64.zip

vagrant up gives you a system with Vault installed but not yet running. I've held off on creating a service description for the moment since we're bootstrapping and will be doing a bunch of manual work.

Alright then, let's fire up the vault:

[root@turtle1 ~]# vault server -config /etc/vault/bootstrap_config.hcl

This won't fork, so open up another connection to do the remainder of the work.

Tell the CLI how to find the server. Note the use of HTTP since TLS is turned off:

[vagrant@turtle1 ~]$ export VAULT_ADDR=http://127.0.0.1:8200

Confirm that everything is operating correctly by checking the status of the server:

[vagrant@turtle1 ~]$ vault status
Key                Value
---                -----
Seal Type          shamir
Initialized        false
Sealed             true
Total Shares       0
Threshold          0
Unseal Progress    0/0
Unseal Nonce       n/a
Version            n/a
HA Enabled         false

Note a couple of items:

The vault isn't initialized; we'll need to do that before anything else.
The vault is "sealed", which means that it's running but none of the information is decrypted. The vault will need to be "unsealed" before any other operations can take place.

Alright then, let's initialize the vault:

[vagrant@turtle1 ~]$ vault operator init
Unseal Key 1: BAqaQJiVsYE1ufAD6hraNoKL6aprDWqbZKtj5Kslig3B
Unseal Key 2: G2AqRzuUKlmLefb98YT8xv2ei2SQApaUvPa09jiXb8Uh
Unseal Key 3: I70F/pMpjHIoSb2Gp66Y7M5AwxwO2LJ6Jx9hHrhfuHnz
Unseal Key 4: IY6cD7ymCNQyiGaMWsa6xQ37mqwVFU05GQ+9xts6TYBS
Unseal Key 5: TPR2p9WA4P0G2Q6/g+kEp9SWKlwl83d7EXpl8Hosnevp

Initial Root Token: s.C0mtdFlTNUUjK6LFHPwuxOYV

Vault initialized with 5 key shares and a key threshold of 3. Please securely
distribute the key shares printed above. When the Vault is re-sealed,
restarted, or stopped, you must supply at least 3 of these keys to unseal it
before it can start servicing requests.

Vault does not store the generated master key. Without at least 3 key to
reconstruct the master key, Vault will remain permanently sealed!

It is possible to generate new unseal keys, provided you have a quorum of
existing unseal keys shares. See "vault operator rekey" for more information.

The keys and token above are required to unseal and authenticat to the vault, and there's no way to get them back if you lose them. If you were doing this in a production environment you'd want to print the above out and stick it in a safe or some such.

Note that the initialization process produced 5 keys. Vault implement's Shamir's Secret Sharing, which is basically a way to share out a secret in pieces such that it can be reconstructed by a quorum of piece-holders. By default Vault produces 5 keys, any 3 of which can be used to condocuct the unsealing process. So let's get to it:

[vagrant@turtle1 ~]$ vault operator unseal
Unseal Key (will be hidden): 
Key                Value
---                -----
Seal Type          shamir
Initialized        true
Sealed             true
Total Shares       5
Threshold          3
Unseal Progress    1/3
Unseal Nonce       ae24a0e1-ec84-6793-e077-0fca1947a6d7
Version            1.1.0
HA Enabled         false
[vagrant@turtle1 ~]$ vault operator unseal
Unseal Key (will be hidden): 
Key                Value
---                -----
Seal Type          shamir
Initialized        true
Sealed             true
Total Shares       5
Threshold          3
Unseal Progress    2/3
Unseal Nonce       ae24a0e1-ec84-6793-e077-0fca1947a6d7
Version            1.1.0
HA Enabled         false
[vagrant@turtle1 ~]$ vault operator unseal
Unseal Key (will be hidden): 
Key             Value
---             -----
Seal Type       shamir
Initialized     true
Sealed          false
Total Shares    5
Threshold       3
Version         1.1.0
Cluster Name    vault-cluster-4698ed15
Cluster ID      77f26182-854d-ffe8-c2ef-d194f93f13e3
HA Enabled      false

And there you have it, one fresh, unsealed, fully operational vault instance. We'll pick up next time with using the PKI engine to generate certificates for etcd.

Monitoring The Bottom Turtle (4/N)

When we last left off we'd just configured Prometheus to monitor our etcd cluster. Next up, visualizing all that lovely data via Grafana.

Making a dashboard for etcd is a two-step process:

Configure Grafana to talk to Prometheus.
Create a dashboard using the Prometheus data.

Grafana supports Prometheus as a data source out-of-the-box, so step 1 is easy. Here's the appropriate config file:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090

This defines a data source named "Prometheus" of type "prometheus" (clever, huh?) running on the same machine as Grafana. Name this grafana_prometheus.yml and drop it in your Vagrant project directy.

Putting together a dashboard is a little more complex. Graphana is highly, highly flexible, providing lots of different ways to slice and visualize data. Thankfully, the CoreOS people have an etcd Grafana dashboard which we can take and modify slightly. Save that file as grafana_etcd.json and put it in your Vagrant directory.

That's all you need, configuration-wise. Here's the bootstrap automation to put it all together:

yum install -y https://dl.grafana.com/oss/release/grafana-6.0.2-1.x86_64.rpm
mkdir -p -m655 /var/lib/grafana/dashboards
mv /vagrant/grafana_prometheus.yml /etc/grafana/provisioning/datasources/
mv /vagrant/grafana_etcd.json /var/lib/grafana/dashboards/etcd.json
sed -i '
/title/ s/test-etcd/Etcd/
/datasource/ s/test-etcd/Prometheus/
' /var/lib/grafana/dashboards/etcd.json
sed -i -e 's/^#//' /etc/grafana/provisioning/dashboards/sample.yaml
service grafana-server start

Note that if you were to vagrant up at this point you'd end up with Grafana servers running on port 3000 on each of the cluster nodes. However, you wouldn't be able to conveniently get to any of them. Instead, configure port forwarding like we did for Prometheus; add a line to forward port 3000 on the host system to port 3000 on turtle1.

  config.vm.define "turtle1" do |turtle1|
    turtle1.vm.hostname = "turtle1"
    turtle1.vm.network "private_network", ip: "10.0.0.2",
      virtualbox__intnet: true
    turtle1.vm.network "forwarded_port", guest: 9090, host: 9090
    turtle1.vm.network "forwarded_port", guest: 3000, host: 3000
    turtle1.vm.provision :shell, path: "bootstrap.sh", args: "10.0.0.2 turtle1"
  end

Provided that everything has been bolted together correctly, when you vagrant up and start the etcd cluster you should be able to connect to port 3000 on your local machine and see something like this:

What's neat about the whole Prometheus/Grafana setup is that it pretty much just does the right thing. For example, it automatically provides metrics for all three hosts where appropriate.

In my first post on the subject, I said that I wanted to set up a cluster which would monitor itself. We've pretty much done that here, but in a production setup with 5 nodes it would be overkill to have that much monitoring infrastructure. Having put all the pieces together I think it would make more sense to only install the monitoring on a subset of the cluster nodes. 2 Prometheus/Grafana servers, set up to monitor each other, should provide a sufficient degree of reliability in most cases.