We recently moved several of our projects to the new Google Cloud Build for building container images and pushing them to the repository. It’s a pretty simple system (not a full CI) but it does the job well, and I liked having the “build” part separate from the “run tests” part of the toolchain. That said, I feel like this is among the many tools that leave me writing bash scripts in YAML.
We recently upgraded many of our services to Rails 5.2 which installs
bootsnap to make start-up faster. However,
bootsnap depends on caching data to the local file system and our production containers run with read-only file systems for security. So I decided to remove
bootsnap in production:
# Gemfile #... group :development, :test do gem 'bootsnap', '~> 1.3' end #... # config/boot.rb #... require "bundler/setup" # Set up gems listed in the Gemfile. begin require "bootsnap/setup" # Speed up boot time by caching expensive operations. rescue LoadError # bootsnap is an optional dependency, so if we don't have it it's fine # Do not load in production because file system (where cache would be written) is read-only nil end
Last year we changed our EC2 system from long-running instances to on-demand, spot request instances. This reduced our EC2 bill by 98%. It also ensured that every instance was built with the latest image and security patches and ran only as long as needed.
We run lots of background jobs and background workers. Some of these are pretty consistent load and some vary greatly. In particular we have a background process that can consume 30GB or more of memory and run for over a day. For smaller workloads it could complete in 15 minutes (consuming a lot less memory). At other times this queue can be empty for days.
Traditionally Resque is run with one or more workers monitoring a queue and picking up jobs as they show up (and as the worker is available). To support jobs that could scale to 30GB that meant allocating 30GB per worker in our cluster. We didn’t want to allocate lots of VMs to run workers that might be idle much of the time. In fact we didn’t really want to allocate any VMs to run any workers when there were no jobs in the queue. So we came up with a solution that uses Kubernetes Jobs to run Resque jobs and scale from zero.
We’ve open sourced our resque-kubernetes gem to do this.
This is a follow up on the post PostgreSQL Monitoring and Performance Tuning.
In order to get the most out of our servers I was tracking PostgreSQL’s use of memory and temporary disk space. It seemed that we were pushing the attached disks beyond their capabilities, so I set up a chart to track disk utilization. While I was able to increase
work_mem and not see any deleterious effects, if we went too high we would run out of memory, so I set up a chart of percent of memory used.
By monitoring these while I increased the
work_mem, I found the point at which queries held on disk dropped to very little and disk utilization dropped from being pinned at 100%.
Over the past six months we moved all of our infrastructure from virtual machines and bare metal to Google Cloud Platform. We moved much of our systems to Docker and Kubernetes on Google Container Engine (GKE). Our databases we migrated from bare metal to virtual machines in Google Compute Engine (GCE). The entire effort took about 6 weeks of work and 3 months of stabilization. In the end we have mostly better performance (databases are a bit slower), much better scaling options, lower costs, but less stability (we’re working on it; plus resiliency).
I’ve gathered some notes and am doing a presentation on this in the coming week. I’d like to expand on these thoughts some more and will link to future related posts as I complete them. Below are some of the more salient points of the adventure.
As our PostgreSQL database needs grow, I needed to adjust how it used memory to make the most use of what we are paying for. Tuning Your PostgreSQL Server was really helpful in understanding the parameters to adjust, what they affect, and their relative importance.
effective_cache_size are the parameters that I was mostly looking at to get memory use right.
In order to get a good picture and know if my changes were effective I needed to monitor how we were using memory. I set up our servers to record metrics to Graphite and configured a Grafana dashboard to show usage over time.