Shiny Ideas: Bonus Round: HA Vault On Top of Etcd!

When we left off we'd just finished converting the etcd cluster to speak TLS. What we didn't get around to, however, was taking the Vault implementation from a single, stand-lone server to an HA implementation. Let's finish that bit of bootstrapping now.

In order to run an HA vault you need to select a storage backend that supports HA operations. Luckily for us, one of the available backends is etcd. Getting that configured is straightforward:

[root@turtle1 ~]# cat /etc/vault/vault_etcd.hcl
storage "etcd" {
  address = "https://turtle1.localdomain:2379,https://turtle2.localdomain,https://turtle3.localdomain"
  ha_enabled = "true"
}
listener "tcp" {
  address     = "turtle1.localdomain:8200"
  tls_cert_file = "/etc/pki/tls/certs/turtle1.localdomain.crt"
  tls_key_file = "/etc/pki/tls/private/turtle1.localdomain.key"
  tls_disable_client_certs = 1
}

api_addr    = "https://turtle1.localdomain:8200"

The storage stanza, which previously pointed to local filesystem storage, is now pointing to the etcd cluster. That's all there is to it.

"But wait!", you may be saying if you've spent a lot of time in the ops trenches, "I didn't tell Vault about its peers". That right, you didn't, and you don't need to; Vault is going to take care of all of that for you via its HA storage system. So just start up your vault nodes using the config file above.

Done? Good. Ok, let's check the vault status:

[vagrant@turtle1 ~]$ export VAULT_ADDR=https://turtle1.localdomain:8200
[vagrant@turtle1 ~]$ vault status
Key                Value
---                -----
Seal Type          shamir
Initialized        false
Sealed             true
Total Shares       0
Threshold          0
Unseal Progress    0/0
Unseal Nonce       n/a
Version            n/a
HA Enabled         true

Note that "HA Enabled" is now "true", which tells you that the HA subsystem is working correctly. Note also that, because we switched from file storage to etcd storage, the cluster is uninitialized. So let's init and unseal as per usual...

[vagrant@turtle1 ~]$ vault init
...
[vagrant@turtle1 ~]$ vault operator unseal
...
[vagrant@turtle1 ~]$ vault operator unseal
...
[vagrant@turtle1 ~]$ vault operator unseal
...

If everything has been set up correctly you should see something like the following log messages once the first node is unsealed:

2019-04-05T02:15:59.862Z [INFO]  core: vault is unsealed
2019-04-05T02:15:59.862Z [INFO]  core: entering standby mode
2019-04-05T02:15:59.862Z [INFO]  core.cluster-listener: starting listener: listener_address=10.0.0.2:8201
2019-04-05T02:15:59.862Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=10.0.0.2:8201
2019-04-05T02:15:59.882Z [INFO]  core: acquired lock, enabling active operation

That bit about acquiring the lock means that the node (as expected) has taken the "active" role. Vault uses an active/standby architecture consisting of 1 active node and multiple standbys that can take over if the active node dies. The nice thing about how that's been implemented is that you can conduct operations against any node; the standby nodes will simply forward your request on to the active node.

Note that a cluster node has to be unsealed before it can join the cluster, otherwise it won't be able to read the synchronization data that's stored in etcd. So, let's unseal another node:

[vagrant@turtle2 ~]$ vault operator unseal
...
[vagrant@turtle2 ~]$ vault operator unseal
...
[vagrant@turtle2 ~]$ vault operator unseal
Unseal Key (will be hidden):
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           5
Threshold              3
Version                1.1.0
Cluster Name           vault-cluster-30295974
Cluster ID             e0e74985-2d25-c4bb-6645-7ffa1327ba9c
HA Enabled             true
HA Cluster             https://turtle1.localdomain:8201
HA Mode                standby
Active Node Address    https://turtle1.localdomain:8200

W00t! The second node can read the cluster synchronization data. Unsealing the third node will give you a similar result.

So, does the HA work? Let's find out! "^C" the server on turtle1, and see what shows up in the log for turtle2:

2019-04-05T02:16:42.457Z [INFO]  core: acquired lock, enabling active operation
2019-04-05T02:16:42.510Z [INFO]  core: post-unseal setup starting
2019-04-05T02:16:42.514Z [INFO]  core: loaded wrapping token key
2019-04-05T02:16:42.514Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
2019-04-05T02:16:42.523Z [INFO]  core: successfully mounted backend: type=system path=sys/
2019-04-05T02:16:42.523Z [INFO]  core: successfully mounted backend: type=identity path=identity/
2019-04-05T02:16:42.523Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2019-04-05T02:16:42.553Z [INFO]  core: successfully enabled credential backend: type=token path=token/
2019-04-05T02:16:42.553Z [INFO]  core: restoring leases
2019-04-05T02:16:42.553Z [INFO]  rollback: starting rollback manager
2019-04-05T02:16:42.557Z [INFO]  expiration: lease restore complete
2019-04-05T02:16:42.564Z [INFO]  identity: entities restored
2019-04-05T02:16:42.568Z [INFO]  identity: groups restored
2019-04-05T02:16:42.571Z [INFO]  core: post-unseal setup complete

turtle2 took over the "active" role pretty much instantaneously.

The last thing we really want to do is import the root cert we generated previously. The Vault API provides a mechanism for importing CA material. All we need is the cert that we generate and its associated private key.

Well, crap... it turns out that there's no way to get the private key for the CA after the fact. So, one revision that I'd make to the Vault bootstrapping process is to use the exported version of the call to generate the CA and then stuff the key somewhere that it could be imported into the HA cluster.

This brings us to the end of a rather long set of posts. Let's review:

A review of off-the-shelf systems that might be used to consistently distribute data.
Setup of an HA etcd cluster, sans TLS.
Selection of an appropriate Python driver for working with an HA etcd cluster.
A trivial example of a program which can dynamically update itself in response to changes in etcd keys.
A series of posts (1, 2, 3, 4) on monitoring the cluster using Prometheus and Grafana.
Creating a bootstrap Vault instance to use as an initial certificate authority.
Using the bootstrap instance to generate an initial CA and host cert, and then using the cert to enable TLS communication with Vault.
And, finally, generating all the certs and configuring the etcd cluster for TLS operation.

The end state is that we've gone from zero to an HA Vault instance running on top of TLS-secured etcd which is capable of self-monitoring. This is close(-ish) to something that you could use as a root-of-trust in a production environment, though you'd still want to go through hardening, figure out an appropriate auth scheme, and establish a backup/disaster recovery procedure.

Saturday, April 13, 2019

Bonus Round: HA Vault On Top of Etcd!

0 Comments:

Previous Posts