Friday, March 29, 2019

Monitoring The Bottom Turtle (1/N)

We've set up an etc cluster, but that's only part of the story. If you've gone through all the trouble to set up an operationally robust keystore that tends to imply an interest in operational hygiene i.e. monitoring and alerting. So what's the best way to go about doing that, given that I've previously characterized the cluster as the "bottom turtle"?

The problem has an aspect of "quis custodiet ipsos custodes?". One option is to stand up an external monitoring server, but that leads to a mismatch in terms of robustness; the thing doing the monitoring is less highly-available than the thing being monitored. How do you make sure the monitoring server hasn't tipped over?

At a previous employer (the same one that had the data distribution infrastructure) we solved this problem by setting up clusters that monitored themselves. It turns out that if you've set up an N+2 cluster you already have a lot of the raw materials at hand; just have each node keep an eye on its peers. Done correctly, you end up with a monitoring solution that matches the robustness of the thing being monitored. Note that if you go this route you'll end up running multiple services on the same machine; if that makes you twichy there's always Docker.

So what's the recommendation for monitoring etcd? The CoreOS people recommend Prometheus for data collection and Grafana for visualization; seeing no reason to second guess that choice we'll see if we can build something out using those tools. The goal is to build out a system which:

  • Allows each node to monitor its peers.
  • Presents a unified view of monitoring data.
We'll pick up there next time.

0 Comments:

Post a Comment

<< Home

Blog Information Profile for gg00