Managing Cron Jobs in Kubernetes with Ruby

We run a Rails application as Docker containers in Kubernetes. Our application and services have a fair number of scheduled tasks and before we moved to containers these ran using cron in the VM that the server ran on. When we moved to Docker we first migrated by deploying a container that just ran our cron jobs. We’ve now migrated those to native Kubernetes Cron jobs which vastly improved the resiliency of our system.

Screen Shot 2020-01-08 at 4.41.08 PM

Continue reading

The New Google Cloud Build

We recently moved several of our projects to the new Google Cloud Build for building container images and pushing them to the repository. It’s a pretty simple system (not a full CI) but it does the job well, and I liked having the “build” part separate from the “run tests” part of the toolchain. That said, I feel like this is among the many tools that leave me writing bash scripts in YAML.

Continue reading

Disable Rails bootsnap in Production

We recently upgraded many of our services to Rails 5.2 which installs bootsnap to make start-up faster. However, bootsnap depends on caching data to the local file system and our production containers run with read-only file systems for security. So I decided to remove bootsnap in production:

# Gemfile
#...
group :development, :test do
  gem 'bootsnap', '~> 1.3'
end
#...

# config/boot.rb
#...
require "bundler/setup" # Set up gems listed in the Gemfile.
begin
  require "bootsnap/setup" # Speed up boot time by caching expensive operations.
rescue LoadError
  # bootsnap is an optional dependency, so if we don't have it it's fine
  # Do not load in production because file system (where cache would be written) is read-only
  nil
end

Continue reading

Autoscaling Resque with Kubernetes

We run lots of background jobs and background workers. Some of these are pretty consistent load and some vary greatly. In particular we have a background process that can consume 30GB or more of memory and run for over a day. For smaller workloads it could complete in 15 minutes (consuming a lot less memory). At other times this queue can be empty for days.

Traditionally Resque is run with one or more workers monitoring a queue and picking up jobs as they show up (and as the worker is available). To support jobs that could scale to 30GB that meant allocating 30GB per worker in our cluster. We didn’t want to allocate lots of VMs to run workers that might be idle much of the time. In fact we didn’t really want to allocate any VMs to run any workers when there were no jobs in the queue. So we came up with a solution that uses Kubernetes Jobs to run Resque jobs and scale from zero.

We’ve open sourced our resque-kubernetes gem to do this.

Continue reading

PostgreSQL Monitoring and Performance Tuning: Phase 2

This is a follow up on the post PostgreSQL Monitoring and Performance Tuning.

In order to get the most out of our servers I was tracking PostgreSQL’s use of memory and temporary disk space. It seemed that we were pushing the attached disks beyond their capabilities, so I set up a chart to track disk utilization. While I was able to increase work_mem and not see any deleterious effects, if we went too high we would run out of memory, so I set up a chart of percent of memory used.

screen-shot-2017-01-12-at-9-59-53-am

By monitoring these while I increased the work_mem, I found the point at which queries held on disk dropped to very little and disk utilization dropped from being pinned at 100%.

Continue reading

Moving to Google Cloud Platform, Docker, and Kubernetes

Over the past six months we moved all of our infrastructure from virtual machines and bare metal to Google Cloud Platform. We moved much of our systems to Docker and Kubernetes on Google Container Engine (GKE). Our databases we migrated from bare metal to virtual machines in Google Compute Engine (GCE). The entire effort took about 6 weeks of work and 3 months of stabilization. In the end we have mostly better performance (databases are a bit slower), much better scaling options, lower costs, but less stability (we’re working on it; plus resiliency).

I’ve gathered some notes and am doing a presentation on this in the coming week. I’d like to expand on these thoughts some more and will link to future related posts as I complete them. Below are some of the more salient points of the adventure.

Continue reading