Friday, March 29, 2019

Monitoring The Bottom Turtle (3/N)

So... making Prometheus monitor the etcd cluster (and underlying hardware) turns out to be... pretty trivial. Prometheus provides a daemon called "node_exporter" which does the heavy lifting from a machine perspective, while etcd supports Prometheus out of the box. In order to hook is all together one must:

  1. Create an init script/systemd file for node_exporter
  2. Update bootstrap.sh to install node_exporter.
  3. Update prometheus.yml to scrape node_exporter and the etcd.

First up, node_exporter.service:

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
Environment="GOMAXPROCS=1"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/node_exporter

SyslogIdentifier=prometheus
Restart=always

[Install]
WantedBy=multi-user.target

And the additions to bootstrap.sh:

wget -nv https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
tar zxf node_exporter-0.17.0.linux-amd64.tar.gz
cp node_exporter-0.17.0.linux-amd64/node_exporter /usr/local/bin/
mv /vagrant/node_exporter.service /etc/systemd/system
service node_exporter start
rm -rf node_exporter-0.17.0.linux-amd64*

And, lastly, the revised prometheus.yml:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['turtle1:9090', 'turtle2:9090', 'turtle3:9090']

  - job_name: 'node_exporter'

    static_configs:
      - targets: ['turtle1:9100', 'turtle2:9100', 'turtle3:9100']

  - job_name: 'etcd'

    static_configs:
      - targets: ['turtle1:2379', 'turtle2:2379', 'turtle3:2379']
Two additional jobs have been added, 'node_exporter' and 'etcd', the former which scrapes the node_exporter process (which lives on port 9100 by default) and the latter which scrapes etcd on its default port.

Once all that has been put in place and the Vagrant cluster has been rebuilt you should see 9 targets across three jobs (prometheus, etcd, and node-exporter) when you navigate to http://localhost:9090/.

Ok, great, you've got Prometheus gathering all sorts of number. Now what?

Prometheus does two main things with this information, alerting and visualization. I trust that Prometheus alerting works; paging someone when things go south is a well-understood problem. The only complication is that there are multiple servers all monitoring the same thing, so in production you'd get multiple emails. Not sure how you'd fix that, given that Prometheus servers are independent by design.

As for visualization, I'll let the Prometheus docs speak for themselves: Console templates are the most powerful way to create templates that can be easily managed in source control. There is a learning curve though, so users new to this style of monitoring should try out Grafana first. Yeah, let's do that. I've been wanting to mess with Grafana for awhile anyway.

Monitoring The Bottom Turtle (2/N)

In the previous post we decided to build a self-monitoring cluster using Prometheus and Grafana. This post will focus on gettting Prometheus up and running on the same cluster which is used to house etcd. The usual caveats apply, this is just messing around and isn't vetted for production.

Setting up Prometheus is a little involved since there don't appear to be any official RPMs. Briefly:

  1. Create an initial Prometheus config file.
  2. Create an systemd service description (or init script).
  3. Use bootstrap.sh to install Prometheus, move the above files into place, and start the system.
  4. Modify the Vagrant network config so you can get to one of the Prometheus servers.

So, first things first, let's put together a basic configuration file. The file below is a slightly-modified version of the stock configuration file from the Prometheus Getting Started Guide:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['turtle1:9090', 'turtle2:9090', 'turtle3:9090']
The way Prometheus works is to contact targets via HTTP and scrape data from /metrics. The Prometheus server itself is bound to port 9090 on each server, so the fragment above essentially tells it to monitor itself across all 3 cluster nodes. Pretty cool, huh? Note that the servers themselves are totally independent; each server polls itself and peers independently. As such, each server will have slightly different data samples, but that's a non-issue in the context of monitoring.

Name it prometheus.yml and stick it in the Vagrant project directory, which will automatically make it accessible to bootstrap.sh later on.

Another bit you have to supply is some sort of init script. Here's a systemd script recommended by Computing For Geeks:

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
Environment="GOMAXPROCS=2"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.external-url=

SyslogIdentifier=prometheus
Restart=always

[Install]
WantedBy=multi-user.target
Name this prometheus.service and put it in the Vagrant directory as well.

Update bootstrap.sh to install Prometheus. The script below is mostly stolen from the same Computing For Geeks page:

yum install -y wget
wget -nv https://github.com/prometheus/prometheus/releases/download/v2.8.1/prometheus-2.8.1.linux-amd64.tar.gz
tar zxf prometheus-2.8.1.linux-amd64.tar.gz
groupadd --system prometheus
useradd -s /sbin/nologin --system -g prometheus prometheus
mkdir /var/lib/prometheus
for i in rules rules.d files_sd; do
  mkdir -p -m 775 /etc/prometheus/${i};
  chown -R prometheus:prometheus /etc/prometheus/${i};
done
cp prometheus-2.8.1.linux-amd64/prometheus /usr/local/bin/
cp prometheus-2.8.1.linux-amd64/promtool /usr/local/bin/
cp -r prometheus-2.8.1.linux-amd64/consoles /etc/prometheus
cp -r prometheus-2.8.1.linux-amd64/console_libraries /etc/prometheus
mkdir -p /var/lib/prometheus
chown -R prometheus:prometheus /var/lib/prometheus/
mv /vagrant/prometheus.yml /etc/prometheus
mv /vagrant/prometheus.service /etc/systemd/system
service prometheus start
rm -rf prometheus-2.8.1.linux-amd64*
Note the mv /vargrant/... commands which move the configuration files into place. Once bootstrap.sh completes the Prometheus server should be up and running on the machine.

At this point we could go ahead and vagrant up the cluster and everything should work, but we wouldn't be able to actually get to any of servers without some trickery. This simplest solution is just to enable port forwarding for one of the machines. For example:

  config.vm.define "turtle1" do |turtle1|
    turtle1.vm.hostname = "turtle1"
    turtle1.vm.network "private_network", ip: "10.0.0.2",
      virtualbox__intnet: true
    turtle1.vm.network "forwarded_port", guest: 9090, host: 9090
    turtle1.vm.provision :shell, path: "bootstrap.sh", args: "10.0.0.2 turtle1"
  end
This will let you navigate to localhost:9090 and interact with the Prometheus host running on turtle1. You can check to make sure that everything is working by navigating to http://localhost:9090/targets, which should show the three targets all up and healthy.

So at this point we've got Prometheus up and running, but it isn't yet collecting data about either the etcd cluster or the underlying machines. We'll do that next time.

Monitoring The Bottom Turtle (1/N)

We've set up an etc cluster, but that's only part of the story. If you've gone through all the trouble to set up an operationally robust keystore that tends to imply an interest in operational hygiene i.e. monitoring and alerting. So what's the best way to go about doing that, given that I've previously characterized the cluster as the "bottom turtle"?

The problem has an aspect of "quis custodiet ipsos custodes?". One option is to stand up an external monitoring server, but that leads to a mismatch in terms of robustness; the thing doing the monitoring is less highly-available than the thing being monitored. How do you make sure the monitoring server hasn't tipped over?

At a previous employer (the same one that had the data distribution infrastructure) we solved this problem by setting up clusters that monitored themselves. It turns out that if you've set up an N+2 cluster you already have a lot of the raw materials at hand; just have each node keep an eye on its peers. Done correctly, you end up with a monitoring solution that matches the robustness of the thing being monitored. Note that if you go this route you'll end up running multiple services on the same machine; if that makes you twichy there's always Docker.

So what's the recommendation for monitoring etcd? The CoreOS people recommend Prometheus for data collection and Grafana for visualization; seeing no reason to second guess that choice we'll see if we can build something out using those tools. The goal is to build out a system which:

  • Allows each node to monitor its peers.
  • Presents a unified view of monitoring data.
We'll pick up there next time.

Thursday, March 28, 2019

Distributing Data Consistently (4/N)

In the previous post we figured out how to get Python to talk to etcd. Now let's do something with it.

One of the strengths of the previously-mentioned company where I used to work is that they invested a lot in shared infrastructure. This included developing a bunch of standard APIs that just worked without having to think a whole lot about them. That's one of the goals for this exercise. So let's start by seeing if we can put together a pain-free service locator.

Here's a simple, dumb implementation:

import etcd

class ServiceFinder:
  def __init__(self):
    self.etcd_client = etcd.Client(
      host=(('turtle1', 2379), ('turtle2', 2379), ('turtle3', 2379)),
      allow_reconnect=True
    )

  def get_service(self, service):
    return self.etcd_client.get('/services/' + service).value
and some mucking about in REPL:
>>> import service_finder
>>> sf = service_finder.ServiceFinder()
>>> sf.get_service('cdd')
'[ "turtle1", "turtle2", "turtle3" ]'

Look at how simple that is... import the module and ask it to tell you about the 'cdd' service.

"But wait, GG... you cheated. You hard-coded the etcd cluster addresses."

Yes, I did, but I wouldn't call it "cheating". Someone has to be the bottom turtle (see what I did there?). If you're building a system for service discovery then hardcoding the location of that system is perfectly justifiable. And in the real world you might do the hardcoding slightly more elegantly, such as through DNS SRV records. You can minimize the number of places where you have to hardcode things, but you can't eliminate them entirely.

Can we make it a little fancier, maybe have it tell us when the key changes?

Yes, yes we can:

import etcd
import threading

class ServiceWatcher(threading.Thread):
  def __init__(self, service, callback):
    threading.Thread.__init__(self)
    self.service = service
    self.etcd_client = etcd.Client(
      host=(('turtle1', 2379), ('turtle2', 2379), ('turtle3', 2379)),
      allow_reconnect=True
    )
    self.callback = callback

  def run(self):
    for event in self.etcd_client.eternal_watch('/services/' + self.service):
      self.callback(event.value)

class ServiceFinder:
  def __init__(self):
    self.etcd_client = etcd.Client(
      host=(('turtle1', 2379), ('turtle2', 2379), ('turtle3', 2379)),
      allow_reconnect=True
    )

  def get_service(self, service):
    return self.etcd_client.get('/services/' + service).value

  def watch_service(self, service, callback):
    watcher_thread = ServiceWatcher(service, callback)
    watcher_thread.start()
The above file (service_finder.py) defines two clases, ServiceFinder and ServiceWatcher. ServiceFinder provides the interface that programs can call, and ServiceWatcher is a sub-class of threading.Thread ('cause that's how threads work in Python 3.x) which will run in the background and watch for changes to a designated service. Here's a trivial code example:
#!/usr/bin/python3.6

import service_finder
import time
import threading

global thread_lock

def service_change_callback(new_data):
  thread_lock.acquire()
  print('Key change: ' + new_data)
  thread_lock.release()

thread_lock = threading.Lock()
sf = service_finder.ServiceFinder()
sf.watch_service('cdd', service_change_callback)

while True:
  time.sleep(1)
  thread_lock.acquire()
  print("Critical Section!")
  thread_lock.release()
Again, trying to keep things as simple as possible. Note that ServiceWatcher doesn't show up at all; the program doesn't need to know the gory details of how the threading works under the hood. All anyone needing to use the service_finder module needs to know is:
  • watch_service takes a callback which will magically get called whenever the specified service changes.
  • A modicum of locking is needed to prevent race conditions in the off chance that the callback is changing things that are used elsewhere (which is sort of the point of the callback in the first place).

Can it be made simpler? Can we hide the thread locking code somehow?

I don't think so. It would be simple enough to automagically wrap the callback in an acquire()/release() pair, but what about all the code that doesn't get touched by service_finder? No, the scope that invokes the service_finder module is the appropriate location to embody information about what the critical sections of code are. If I were writing production code the above would probably be turned into a module which could be invoked without worrying about the synchronization bits.

Anyway, there you go, that's the basic pattern. Once you've got that in place it becomes straightforward to perform all sorts of distributed operations in a safe and convenient matter.

Distributing Data, Consistently (3/N)

Last time, we determined that setting up an etcd cluster really wasn't all that difficult. Now we'll actually try to do something with it.

First, lets try getting and setting some values using the etcdctl tool:

[vagrant@turtle1 ~]$ etcdctl set foo bar
bar
[vagrant@turtle1 ~]$ etcdctl get foo
bar
[vagrant@turtle1 ~]$ etcdctl set /services/cdd '[ "turtle1", "turtle2", "turtle3" ]'
[ "turtle1", "turtle2", "turtle3" ]
[vagrant@turtle1 ~]$ etcdctl get /services/cdd
[ "turtle1", "turtle2", "turtle3" ]
[vagrant@turtle1 ~]$ etcdctl ls /
/foo
/services
[vagrant@turtle1 ~]$ etcdctl ls /services/
/services/cdd
So far, so good. We can store/retrieve simple values, store/retrieve small JSON docs. And one of the nice built-ins of etcd is that it understands hierarchical keys automagically. Having determined that etcd is behaving as advertised from the command line, the next step is to try to do something programmatically.

etcd has a fucktonne of drivers for a wide variety of languages. Let's mess around with Python, since all the kids (cool or not) are using it these days. There's an embarassement of riches here, no fewer than 7 different driver packages. Here's a summary:

DriverSupports V3Supports Connection Pooling
https://github.com/kragniz/python-etcd3X
https://github.com/jplana/python-etcdX
https://github.com/russellhaering/txetcdX
https://github.com/lisael/aioetcd/
https://github.com/crossbario/txaio-etcdX
https://github.com/dims/etcd3-gatewayX
https://github.com/gaopeiliang/aioetcd3XX
I'd like a drive which supports both connection pooling and etcd V3, though I'm really more concerned with the former since we're trying to build HA systems. It doesn't do you a whole lot of good to have an HA keystore if you're only configured to talk to the node that's down.

aioetcd3 seems like the best fit, but after some experimentation I just couldn't get it to return data. I could see it talking to the cluster, but I wasn't able to get any data back at the programmatic level. This could be due to some subtle incompatibility between the drive and the version of etcd installed, or it could be the result of my complete unfamiliarity with Python's asynchronous IO system.

Next on the list is python-etcd, which I was able to get working without issue. Here's the installation:

yum install -y https://centos7.iuscommunity.org/ius-release.rpm
yum install -y python36u python36u-libs python36u-devel python36u-pip
pip3.6 install python-etcd
and here's some simple messing around via REPL:
[vagrant@turtle1 ~]$ python3.6
Python 3.6.7 (default, Dec  5 2018, 15:02:05)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import etcd
>>> client = client = etcd.Client(host=(('turtle1', 2379), ('turtle2', 2379), ('turtle3', 2379)), allow_reconnect=True)
>>> client.members
{'aeb6950c050a83f3': {'id': 'aeb6950c050a83f3', 'name': 'turtle3', 'peerURLs': ['http://10.0.0.4:2380'], 'clientURLs': ['http://10.0.0.4:2379']}, 'ee274cacce804b21': {'id': 'ee274cacce804b21', 'name': 'turtle1', 'peerURLs': ['http://10.0.0.2:2380'], 'clientURLs': ['http://10.0.0.2:2379']}, 'fe9c64eaf3991d47': {'id': 'fe9c64eaf3991d47', 'name': 'turtle2', 'peerURLs': ['http://10.0.0.3:2380'], 'clientURLs': ['http://10.0.0.3:2379']}}
>>> client.get('/foo').value
'bar'
>>> client.get('/services/cdd').value
'[ "turtle1", "turtle2", "turtle3" ]'
>>> client.read('/services/cdd', wait = True).value
'[ "turtle1", "turtle2", "not a turtle" ]'
That last call blocked (as desired) until I upated the value via etcdctl

So far, so good. We've got a working etcd cluster, and we're successfully able to talk to it via Python. Next up, let's write some real(-ish) code.

Distributing Data, Consistently (2/N)

In my previous post I jabbered a little bit about the utility of having a strongly-consistent data store that's highly-available and easy to work with from an operational standpoint. This, in turn, led us to the wonderful, magical land of consensus algorithms. After reviewing currently-available implementions we landed on etcd as the most viable candidate for further experimentation. What follows is not at all production-worthy, because it ignores the boring crap like tuning and security.

etcd is a KV store which addresses our needs as follows:

  • Strong consistency: Yes, via the Raft consensus algorithm.
  • HA: Supports shared-nothing clusters.
  • Operationally tractable: Requires no special hardware. Readily supports 5-node clusters, which gives us 3 for quorum, 1 broken, and 1 down for maintenance.
It also has a bunch of nice features under the hood which we'll get to at a later date.

So, let's Vagrant ourselves up a 3-node cluster, just for the purposes of experimentation. Here's the bootstrap script:
#!/usr/bin/env bash
IP=$1
NAME=$2

yum update -y

# Set up toy name resolution
cat <> /etc/hosts
10.0.0.2 turtle1
10.0.0.3 turtle2
10.0.0.4 turtle3
EOM

# Install etcd
yum install -y etcd

# Configure etcd for a three node cluster.
sed -i "
s/https/http/g
/ETCD_LISTEN_PEER_URLS/ s/localhost/$IP/
/ETCD_LISTEN_PEER_URLS/ s/^#//
/ETCD_LISTEN_CLIENT_URLS/ s|=\"|=\"http://$IP:2379,|
/ETCD_NAME/ s/default/$NAME/
/ETCD_INITIAL_ADVERTISE_PEER_URLS/ s/localhost/$IP/
/ETCD_INITIAL_ADVERTISE_PEER_URLS/ s/^#//
/ETCD_ADVERTISE_CLIENT_URLS/ s/localhost/$IP/
/ETCD_INITIAL_CLUSTER=/ s|=\".*$|=\"turtle1=http://10.0.0.2:2380,turtle2=http://10.0.0.3:2380,turtle3=http://10.0.0.4:2380\"|
/ETCD_INITIAL_CLUSTER/ s/^#//
" /etc/etcd/etcd.conf
Nothing fancy here. This statically configures the initial cluster membership, which is totally fine for the purposes of messing around. Etcd also has the ability to discover peers via DNS SRV records, which is probably what you'd do in a production scenario. Note also the use of HTTP rather than HTTPS for the sake of convenience; don't do this in prod. This is what the etcd configuration looks like after bootstrapping:
[vagrant@turtle1 ~]$ grep -v '^#' /etc/etcd/etcd.conf
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://10.0.0.2:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.0.0.2:2379,http://localhost:2379"
ETCD_NAME="turtle1"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.0.0.2:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://10.0.0.2:2379"
ETCD_INITIAL_CLUSTER="turtle1=http://10.0.0.2:2380,turtle2=http://10.0.0.3:2380,turtle3=http://10.0.0.4:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"

Here's the Vagrantfile:

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.box = "centos/7"

  config.vm.define "turtle1" do |turtle1|
    turtle1.vm.hostname = "turtle1"
    turtle1.vm.network "private_network", ip: "10.0.0.2",
      virtualbox__intnet: true
    turtle1.vm.provision :shell, path: "bootstrap.sh", args: "10.0.0.2 turtle1"
  end

  config.vm.define "turtle2" do |turtle2|
    turtle2.vm.hostname = "turtle2"
    turtle2.vm.network "private_network", ip: "10.0.0.3",
      virtualbox__intnet: true
    turtle2.vm.provision :shell, path: "bootstrap.sh", args: "10.0.0.3 turtle2"
  end

  config.vm.define "turtle3" do |turtle3|
    turtle3.vm.hostname = "turtle3"
    turtle3.vm.network "private_network", ip: "10.0.0.4",
      virtualbox__intnet: true
    turtle3.vm.provision :shell, path: "bootstrap.sh", args: "10.0.0.4 turtle3"
  end
end
Again, nothing fancy. Set up three VMs on a (Virtualbox internal) private network and bootstrap them per the script above. Note that there are no instructions to start etcd. etcd is happiest when all nodes in the cluster are up and running before any daemons get started, which means that all three instances should be started after turtle3 is completely provisioned. I couldn't figure out how to do this automatically via Vagrant, so we're left with the following, manual invocation:
for host in turtle1 turtle2 turtle3; do
  vagrant ssh ${host} -- 'sudo service etcd start' 2>&1 > ${host}.log &
done;

If all goes well you should now have a functional etcd cluster:

[vagrant@turtle1 ~]$ etcdctl cluster-health
member aeb6950c050a83f3 is healthy: got healthy result from http://10.0.0.4:2379
member ee274cacce804b21 is healthy: got healthy result from http://10.0.0.2:2379
member fe9c64eaf3991d47 is healthy: got healthy result from http://10.0.0.3:2379
cluster is healthy

W00t. In our next episode we'll look at actually doing some stuff.

Distributing Data, Consistently (1/N)

A long time ago, in a galaxy far, far away, I worked for a company that had this really cool system for distrubting data in a strongly consistent manner. Having such a system is useful for all sorts of things, like distributed locks, and leader election, and dynamic service configuration. Lacking such a system where I currently work I decided to see if I could cobble something similar together.

So, what sort of properties should such a system have? It needs to be

"Strongly consistent" tends to point towards ACID databases (that's what the 'C' is for, after all). However, most readily-available ACID DBs tend aren't so hot when it comes to the second and third points. Many of them are active/passive, which is sorta-HA-but-we-can-do-better. The ones which are active-active typically need to share fancy-ass hardware (like a SAN) to do their thing. Shared hardware is expensive, and simply isn't available in oh... say... a cloud deployment. The holy grail is an active-active, shared-nothing system.

Also, what did I mean by "operationally tractable"? The system has to be operationally servicable in the real world. Specifically, it should support an N+2 deployment. Why N+2? Well

  • N nodes are doing their job.
  • 1 node just failed.
  • 1 node is undergoing routine maintenance.

Taking all three criteria into account (strongly-consistent, HA, and operationally realistic) really limits your options, and points to distributed consensus algorithms. Our candidates are:

  • Paxos
  • Raft
  • Chandra-Toueg

A list of general-purpose data storage things that use any of the above for strong consistency:

  • Various bits of Google infrastructure (Chubby, Spanner): Not generally available for public consumption.
  • Doozerd: Tailored to the use case, but the last commit was 2013 and I, not being a Go expert, wasn't able to get it to build.
  • Etcd: Looks similar to Doozerd, but is in active development.
  • neo4j: Does something called "causal clustering", which I'm not convinced is what I'm looking for. Also, being a graph DB we're into square-peg-round-hole territory.
  • Clustrix: Commercial, not readily available.
  • Riak KV: Readily available, and has good docs. However, there's a big-ass warning that the strong consensus code is experimental.
Etcd looks like the best place to start; I'll pick up there next time.

Blog Information Profile for gg00