About 6 months ago (in a galaxy pretty close to our office) …
Our old hosting provider was having network issues… again. There had been a network split around 3:20 AM, which had caused a few of our worker servers to become disconnected from the rest of our network. The background jobs on those workers kept trying to reach our other services until their timeout was reached, and they gave up.
This had already been the second incident in that month. Earlier, a few of our servers had been rebooted without warning. We were lucky that these servers were part of a cluster that could handle suddenly failing workers gracefully. We had taken care that rebooted servers would start up all their services in the right order and would rejoin the cluster without manual intervention.
However, if we would have been unlucky, and e.g. our main database server would have been restarted without warning, then we would have had some downtime and, potentially, would have had to manually fail over to our secondary database server.
We kept joking about how the flakiness of our current hosting provider was a great “Chaos Monkey”-like service which forced us to make sure that we had proper retry-policies and service start-up sequences in place everywhere.
But there were also other issues: booting up new machines was a slow and manual process, with few possibilities for automation. The small maximum machine size also started to become an inconvenience, and, lastly, they only had datacenters in the Netherlands, while we kept growing internationally.
It was clear that we needed to do something about the situation.
Which cloud to go to?
Our requirements for a new hosting provider made it pretty clear that we would have to move to one of the three big cloud providers if we wanted to fulfill all of them. One of the important things for us was an improved DevOps experience that would allow us to move faster. We needed to be able to spin up new boxes with the right image in seconds. We needed a fully private network that we could configure dynamically. We needed to be flexible in both storage and compute options and be able to scale both of them up and down as necessary. Additional hosted services (e.g. log aggregation and alerting) would also be nice to have. But, most importantly, we needed to be able to control and automate all of this with a nice API.
We had already been using Google Cloud Storage (GCS) in the past and were very content with it. The main reason for us to go with GCS had been the possibility to configure it to be strongly consistent, which made things easier for us. Therefore, we had a slight bias towards Google Cloud Platform (GCP) from the start but still decided to evaluate AWS and Azure for our use case.
Azure fell out pretty quickly. It just seemed too rough around the edges and some of us had used it for other projects and could report that they had cut their fingers on one thing or another. With AWS, the case was different, since it has everything and the kitchen sink. A technical problem was the lack of true strong consistency for S3. While it does provide read-after-write consistency for new files, it only provides eventual consistency for overwrite PUTs and for DELETEs.
Another issue was the price-performance ratio: for our workload, it looked like AWS was going to be at least two times more expensive as GCP for the same performance. While there are a lot of tricks one can use to get a lower AWS bill, they are all rather complex and either require you to get into the business of speculating on spot instances or to commit for a long time to specific instances, which are both things we would rather avoid doing. With GCP, the pricing is very straightforward: you pay a certain base price per instance per month, and you get a discount on that price of up to 80% for sustained use. In practice: If you run an instance 24/7, you end up paying less than half of the “regular” price.
Given that Google also offers great networking options, has a well-designed API with an accompanying command-line client, and has datacenters all over the world, the choice was simple: we would be moving to GCP.
How do we get there?
After the decision had been taken, the next task was to figure out how we would move all of our data and all of our services to GCP. This would be a major undertaking and require careful preparation, testing, and execution. It was clear that the only viable path would be a gradual migration of one service after another. The “big bang” migration is something we had stopped doing a long time ago after realizing that, even with only a handful of services and a lot of preparation and testing, this is very hard to get right. Additionally, there is often no easy path to rollback after you pulled the trigger, leading to frantic fire-fighting and stressed engineers.
The requirements for the migration were thus as follows:
- as little downtime as possible
- possibility to gradually move one service after the other
- testing of individual services as well as integration tests of several services
- clear rollback path for each service
- continuous data migration
This daunting list had a few implications:
- We would need to be able to securely communicate between the old and the new datacenter (let’s call them
- The latency and the throughput between the two would need to be good enough that we could serve frontend requests across datacenters
- Internal DNS names needed to be resolved between datacenters (and there could be no DNS name clashes)
- And, we would have to come up with a way to continuously sync data between the two until we were ready to pull the switch
A plan emerges
After mulling this over for a bit, we started to have a good idea how to go about it. One of the key ingredients would be a VPN that would span both datacenters. The other would be proper separation of services on the DNS level.
On the VPN side, we wanted to have one big logical network where every service could talk to every other service as if they were in the same datacenter. Additionally, it would be nice if we wouldn’t have to route all traffic through the VPN. If two servers were in the same datacenter, it would be better if they could talk to each other directly through the local network.
Given that we don’t usually spend all day configuring networks, we had to do some research first to find the best solution. We talked to another startup that was using a similar setup, and they were relying on heavy-duty networking hardware that had built-in VPN capabilities. While this was working really well for them, it was not really an option for us. We had always been renting all of our hardware and had no intention of changing that. We would have to go with a software solution.
The first thing we looked at was OpenVPN. It’s the most popular open-source VPN solution, and it has been around for a long time. We had even been using it for our office network for a while and had some experience with it. However, our experience had not been particularly great. It had been a pain to configure and getting new machines online was more of a hassle than it should have been. There were also some connectivity issues sometimes where we would have to restart the service to fix the problem.
We started looking for alternatives and quickly stumbled upon zerotier.com, a small startup that had set out to make using VPNs user-friendly and simple. We took their software for a test ride and came away impressed: it literally took 10 minutes to connect two machines, and it did not require us to juggle certificates ourselves. In fact, the software is open-source and they do provide signed DEB and RPM packages on their site.
The best part of ZeroTier, however, is its peer-to-peer architecture: nodes in the network talk directly to each other instead of through some central server and we measured very low latencies and high throughput due to it. This was another concern that we had had with OpenVPN, since the gateway server could have become a bottleneck between the two datacenters. The only caveat about ZT is that it requires a central server for the initial connection to a new server, all traffic after that initial handshake is peer-to-peer.
With the VPN in-place, we needed to take care of the DNS and service discovery piece next. Fortunately, this one was easy: we had been using Hashicorp’s Consul pretty much from the beginning and knew that it had multi-datacenter capabilities. We only needed to find out how to combine the two.
The dream team: Consul and ZeroTier
Getting ZeroTier up and running was really easy:
- First install the
zerotier-oneservice via apt on each server (automate this with your tool of choice).
- Then, issue
sudo zerotier-cli join the_network_idonce to join the VPN.
- Finally, you have to authorize each server in the ZT web interface by checking a box (this step can also be automated via their API, but this was not worth the effort for us).
This will create a new virtual network interface on each server:
The IP address will be assigned automatically a few seconds after authorizing the server. Each server then has two network interfaces, the default one (e.g.
ens4) and the ZT one, called
zt0. They will be in different subnets, e.g.
10.144.x.x, where the first one is the private network inside of the Google datacenter and the second is the virtual private network created by ZT, which spans across both
dc2. At this point, each server in
dc1 is able to ping each server in
dc2 on their ZT interface.
It would be possible to run all traffic over the ZT network, but, for two servers that are anyway in the same datacenter, this would be a bit wasteful due to the (small) overhead introduced by ZT. We, therefore, looked for a way to advertise a different IP address depending on who was asking. For cross-datacenter DNS requests, we wanted to resolve to the ZT IP address, and, for in-datacenter DNS requests, we wanted to resolve to the local network interface.
The good news here is that Consul supports this out-of-the-box! Consul works with JSON configuration files for each node and service. An example of the config for a node is the following:
Consul relies on the
datacenter to be set correctly if it is used for both LAN and WAN requests. The other important flags here are:
advertise_addrthe address to advertise over LAN (the local one in our case)
advertise_addr_wanthe address to advertise over WAN (ZT in our case)
translate_wan_addrsenable to return the WAN address for nodes in a remote datacenter
bind_addrmake sure this is
0.0.0.0(which is the default) so that consul listens on all interfaces
After applying this setup to all nodes in each datacenter, you should now be able to reach each node and service across datacenters. You can test this by e.g. doing
dig node_name.node.dc1.consul once from a machine in
dc1 and once from a machine in
dc2, and they should then respond with the local and with the ZT addresses respectively.
Given this setup, it is then possible to switch from a service in one datacenter to the same service in another datacenter simply by changing its DNS configuration.
Issues we ran into
As with all big projects like this, we ran into a few issues of course:
We encountered a Linux kernel bug that prevented ZT from working. It was easily fixed by upgrading to the latest kernel.
We are using Hashicorp’s Vault for secret management. See our other blogpost for a more in-depth explanation of how we use it. In order to make
vaultwork nicely with ZT we needed to set its
redirect_addrto the consul hostname of the server it is running on, e.g.
redirect_addr = "http://the_hostname.node.dc1.consul:8501". Vault advertises its redirect address in its Consul service definition by default. And this defaults to the private IP in the datacenter it was running in. Setting the
redirect_addrto the Consul hostname ensures that consul resolves to the right address. Debugging this issue was quite the journey and required diving into the source of both Consul and Vault.
Another issue we ran into was that Dnsmasq is not installed by default on GCE Ubuntu images. We rely on Dnsmasq to relay
*.consuldomain names to Consul. It can easily be installed via
Moving the data
While a lot of our services are stateless and could therefore easily be moved, we naturally also need to store our data somewhere and, therefore, had to come up with a plan to migrate it to its new home.
Our main datastores are Postgres, HDFS, and Redis. Each one of these needed a different approach in order to minimize any potential downtime. The migration path for Postgres was straightforward: Using
pg_basebackup, we could simply add another hot-standby server in the new datacenter, which would continously sync the data from the master until we were ready to pull the switch. Before the critical moment we turned on
synchronous_commit to make sure that there was no replication lag and then failed over using the trigger file mechanism that Postgres provides. This technique is also convenient if you need to upgrade your DB server, or if you need to do some maintenance, e.g. apply security updates and reboot.
For HDFS the approach was different: Due to the nature of our application, we refresh all data on it at least every 24 hours. This made it possible to simply upload all of the data to two clusters in parallel and to keep them synced as well. Having the data on the new and the old cluster allowed us to run a number of integration tests that ensured that the old and the new system would return the same results. For a while, we would submit the same jobs to both clusters and compare the results. The result from the new cluster would be discarded, but, if there was a difference, we would send an alert that would allow us to investigate the difference and fix the problem. This kind of “A/B-testing” was an invaluable help that helped ironing out any unforeseen issues before switching over in production.
We use Redis mainly for background jobs, and we have support for pausing jobs temporarily in Jobmachine, our job scheduling system. This made the Redis move easy: We could pause jobs, sync the Redis data to disk,
scp the data over to the new server, run a few integrity tests, update DNS, and then resume processing jobs.
The key in migrating our data was again to do each service individually, validate the data, test the services relying on it, and then switching over once we were sure everything was working correctly.
The issues and limitations of our old hosting provider made it necessary to look for an alternative. It was important for us that we could move all of our services and data gradually and could test and validate each step of the migration. We therefore chose to create a VPN that would span both of our datacenters using ZeroTier. In combination with Consul, this allowed us to have two instances of each service, which we could easily switch between using only a DNS update. For the data migration we made sure to duplicate all data continuously until we were sure everything was working as intended. If you are looking for an easy way to migrate from one datacenter to another, then we can highly recommend looking into both Consul and ZeroTier.