Over the past six months we moved all of our infrastructure from virtual machines and bare metal to Google Cloud Platform. We moved much of our systems to Docker and Kubernetes on Google Container Engine (GKE). Our databases we migrated from bare metal to virtual machines in Google Compute Engine (GCE). The entire effort took about 6 weeks of work and 3 months of stabilization. In the end we have mostly better performance (databases are a bit slower), much better scaling options, lower costs, but less stability (we’re working on it; plus resiliency).
I’ve gathered some notes and am doing a presentation on this in the coming week. I’d like to expand on these thoughts some more and will link to future related posts as I complete them. Below are some of the more salient points of the adventure.
Bare Metal to Cloud Migration
We moved our bare metal database servers to GCE using Cloud Endure. In short, this worked well and fairly quickly; it was easy to test before the production cut-over. It did leave the systems with non-standard configurations (e.g. no default GCP application account installed) but we could easily redeploy new instances with the snapshots. In doing so we moved to a new availability zone and used updated settings. You can see more details in my review of the Cloud Endure migration.
Years ago, we chose bare metal for performance. We needed direct connectivity of the drives to the running machines in order to get the performance we needed for our databases. In early 2014 we tested a migration to Amazon RDS and despite allocating high-IOPS storage, we found performance to be abysmal — about six times slower than the bare metal we had been using. GCE claims to have outstanding SSD persisted disk performance from VMs. In practice we’ve found it to be about 30% slower than the bare metal, locally attached SSDs we were using.
That said, scaling is much simpler — I increased storage, CPU, and memory with only a one-hour maintenance shutdown instead of twelve hours. Additionally, to upgrade drives under bare metal we had a 3-week lead time to order hardware and that sometimes involved a long-term commitment. This gives me a lot of flexibility and we’re continuing to eek out the best performance.
Docker and Kubernetes and GKE
We moved our product from VMs hosted with Blue Box to GKE by converting the web applications, background workers (Resque), and redis to Docker containers. I was not initially interested in using Docker (I’m more inclined to stable products — “boring tech”), but I realized that Kubernetes was an advantage for our background workers. I can use it to manage and scale them (the same tools I use to scale our web applications), removing a lot of custom monit configuration and script; although, likely that’s a net-zero change on lines of code.
Docker has been mostly stable for us, despite rumblings of mutiny in the community about breaking changes. My only real complaint was that there were a couple releases in the 1.13 series on the beta channel with significant bugs (like one release that could not parse Dockerfiles). I ended up rolling back my development environment to the stable channel, which is fine as that is what GKE has. I had set up the beta channel because I needed features of the 1.12 series before it was released.
Docker builds on the other hand are dog slow and the images are ridiculously large. I think that the
aufs file system, as cool as it might seem, doesn’t really help in the end because I get images that over 1GB when the final file system is only 300MB. This happens because aufs stores the changes at each layer. Even if the change to the file is just the permissions, the entire file gets duplicated. We spent weeks tuning our Dockerfiles with all kinds of “tricks” (read “hacks”) to get cached layers to have the most changes and to have rarely changed pieces be cached. The rule of thumb is touch as little as possible when changing an image. For example, rather than modifying ownership for an entire directory of 300MB, we were able to slim the image down by changing only a couple subfolders that happened to be empty anyway. This was a better decision anyway because we didn’t really want all those files to be writable by the user running the service. Another important lesson: use the
.dockerignore file for keeping logs, temporary files, and secrets out of the images.
Kubernetes on the other hand is our Chaos Monkey. It’s extremely powerful and much more complex than it appears to be at first glance (which is a compliment). We have spent most of four months tuning it and trying to address stability issues, mostly related to out-of-memory conditions. Pro tip: set a memory limit on every container. Kubernetes doesn’t stop your container from trying to allocate more memory than the node has available. This can cause the node to kill other containers or processes such as
kubelet running on the node which leaves the node inaccessible and un-inspectable. The upshot of this is that we’re learning how to build much more resilient systems.
Redis doesn’t run very well in Kubernetes as a single master. Redis on a stable environment basically runs like bed rock. In the last five years I’ve had redis crash maybe twice and always related to out-of-memory on the VM it was running on. Kubernetes assumes that every container is ephemeral, so we had redis crashing daily on GKE. This is somewhat ironic as redis appears to be the canonical “hello world” example in a lot of the Kubernetes documentation for running a service. The clear answer, as the examples show, is to run redis as a cluster, which means, at a minimum, two database instances and three sentinel instances. I want my redis databases segregated and that means a lot of running pieces if each needs five containers (plus the additional resources for duplicate database storage). We’re still working on this, but for the short-term we’ve stabilized it by strictly controlling memory requests and limits for almost all of our containers.
We did end up implementing autoscaling for some of our background workers using Kubernetes Jobs and GKE cluster group autoscaling. This is something we could have done previously on the VM environment, but had back-burnered. It became critical after launching to GKE (the timing was unrelated to the migration). Using Kubernetes for the autoscaling, however, means that our solution is portable to another provider (running Kubernetes), which a VM-specific (i.e. cloud provider API-specific) solution would not have been. We’ve open sourced our first pass on the gem to do this. More details on that are forthcoming.
If we want to deploy our currently stored images to a completely new environment (say to address a security issue) it would be a matter of a few seconds to a couple minutes. That’s pretty outstanding and one of the lauded benefits of Docker. Deploying new code, on the other hand, takes 6 – 20 minutes (depending on what code was changed). Most of that is because of how long Docker takes to build layers and upload images. Our old system took 90 second to deploy a code update. A more micro-service architecture might make this faster as the images could be smaller, although I’m not totally convinced on that as I feel there’s added complexity for micro-service deployments.
Our deployments now include the latest OS updates (e.g.
apt-get update) with each release. This means we get security updates as often as they are released. With the old VM system it was more like as often as I thought about it. That’s a change that makes me sleep better at nights.
The cost of GCE is much higher than our bare metal and has slower performance for intensive queries (as mentioned above). However the improved scaling makes up for it. Not having to babysit a 12-hour database server update is worth a little extra money. Snapshot storage, and the associated speed of deploying new instances, also makes this nearly worth the increased cost.
The cost of GCE is significantly lower than our VMs were under Blue Box. In fact the savings more than make up for the increased cost of the machines migrated from bare metal to GCE. In addition, we have much better performance of individual containers versus the processes on the previous VMs.
Google paid support has not been all that helpful. At best they have helped us see things that we missed in debugging a problem. More often they’ve responded along the lines of “it’s not meant to be used that way.” Their resolution time is on the order of weeks, often with days of “I am researching” to be followed up with questions like “how do you know it’s doing that?” I’ve had no sense in any interaction that any of my questions are being considered to help improve the service.
Our Google account rep has been helpful in trying to connect us with people who have answers, but I’ve leaned on him less for support. He’s been very encouraging of us submitting out experience and story so they can improve the service (although that tends to be a one-way flow).
Some of the best support we had was from the Kubernetes open source team (via GitHub issues). Due to the nature of public issues, this does mean that I only posted when I could provide a clear indication of a problem (as opposed to get help diagnosing a problem). They went above and beyond to investigate issues we reported (including having members at Google inspect internal GKE logs). When we started this project in August and September they appeared to be triaging bugs within an hour and responding within a day. That seems to have tapered off and the last few months it looks like they are pretty non-responsive, unfortunately.