CloudEndure: a review

We recently migrated our bare metal databases to Google Cloud Platform. To do that, our Google rep recommended a tool called CloudEndure. I mentioned this to some peers and figured it would be helpful to share a review of the product more broadly.
In short: It’s pretty amazing and once it works it works wonderfully.
CloudEndure can replicate from bare metal or cloud to cloud for migrations or DR. They do a block level migration of the base drive. For moving data to the cloud that meant that once the data was there, at any time I could spin up a “replica” which would begin running as a rebooted server on the disk at the point in time that the last update was made. This made testing and completing a migration of two database servers many hours less work than it would have been to set up slaves.
Their support is very responsive and very helpful as well.
We didn’t have to pay for this service. Google provides it for “free” to get you into their cloud. I still had to pay for the resources used by the replicator in GCP, of course. I learned from another company that outside Google their cost is something like $115/instance/month with a minimum commitment of 10 instances. That’s pretty steep for ongoing disaster recovery.
There were some things I think they could improve:
  • The sales pipeline was silly long. Like they felt they needed to keep me in the pipeline. I talked to two different sales people over the course of a week before they would give me access to the tool to start using it. I had already decided to “buy” when I contacted them.
  • It took four days to get the initial 7TB of data migrated. They stated that was due to the limits of our outbound pipeline at our datacenter which may be true; I’ve never measured it.
  • When we were writing at more than about 20MB/s the replicator lagged behind. Again, they said that they were limited by outbound bandwidth. Their UI wasn’t always clear what was going on, but their support was responsive.
  • There were several quotas we had to increase in our GCP project to allow this to work. Fortunately, CloudEndure was responsive and let me know what the issue was the first time and Google increases quotas within about a minute of the request.
  • Spinning up a replica can take a couple hours (it was 40 minutes in my first test and 2 hours in my final move.). I am sure it’s just because of all the data being moved around, but I was hoping for something faster. [As a note, when I later spun up new instances from snapshots of the migrated system in GCP it took about 20 minutes.]
  • When I stopped replication it removed the agent from the source servers and cleaned up all the resources in GCP, but left the agent running on the replicas (where it had been migrated). I had to contact them to uninstall that manually.

A Man Hears What He Wants to Hear

“Still, a man hears what he wants to hear
And disregards the rest.” — Paul Simon, The Boxer

In the world of “statistics” (which is to say the statement of data; as opposed to the mathematics) there appears a strong affiliation with complex statement that define an inconsequential fact. While not a baseball fan I see this often in popular culture in that sport.

Paraphrasing, “It’s the first time that a third-baseman has hit two home runs in the last four innings of a play-off game!” Or something like that.

Statements like this mean nothing. With increased specificity we feel we have more knowledge when in fact we have less.

It’s not just baseball; it’s business. This same problem underlies many financial and human-welfare decisions. The problem is not thinking of the context in which the “statistics” are stated. The growing p-value concern is evidence of a larger issue: taking statistical (i.e. numerical) results of an analysis and applying them outside the context.

This happens in A/B testing all the time. Which is how some companies build their product and make decisions about revenue and spending.

There is certainly selection bias and cognitive bias at hand in human nature. “Question everything” is one response, but not, strictly, reasonable. We need to accept conclusions from others if we are to work collaboratively. And stand on the shoulders of giants.

Peer review is how the scientific community has addressed this for years. Writers have always had editors and now customer peer reviews. We have this in the software world through code review. GitHub has done a great job of making pull requests the best code review tool I’ve ever used. Still, selection bias (“I’m not qualified to review that”) and cognitive bias (“looks good to me”) still exist.

This is your call to reach out to those you think might disagree with you and have them review your work. My favorite way to ensure that I have a great solution or have made the right conclusion is to convince someone who is skeptical. (Unless power dynamic or group think come into play.) What’s yours?

 

How do you fsck or change /etc/fstab when root is read-only?

I’m upgrading a server again and we replaced the RAID array with SSDs (it’s a 10x IOPS improvement; so worth it). But booting the system it fails to start because the old UUID listed in /etc/fstab is no longer available.

Screen Shot 2016-01-27 at 6.12.27 PM

That “Unable to resolve” message caused the filesystem check to fail. Linux drops into a shell with the root volume loaded but in read-only mode.

To fix this I needed to change /etc/fstab to remove the offending entry. But I couldn’t write to the file system.

The obvious solution (with help from our hosting partner) is to remount the root partition as read-write:

mount -o remount,rw /

Now, I can edit /etc/fstab, comment out the offending line, exit and the system will reboot into a running server.

 

Configuring Redis for Rails Cache (Ephemeral) and Resque (Persistence)

When we first built the search analytics app that became Keylime Toolbox we knew we wanted to use Resque for background jobs. Because that’s based on Redis, we decided to use Redis for the Rails cache. But as things grew we realized pretty quickly that these are two very different configurations.

Cached data is ephemeral. We keep it in memory so it’s easily accessible, but if the Redis instance fails it’s OK if we lose some of the data (we can always rebuild it).

Resque worker jobs, on the other hand, are not ephemeral. When we queue a job we expect it to be run and if the Redis instance crashes we want to make sure we can recover where we left off.

While we continued with Redis for both, we spun up two distinct Redis instances and with different configurations.
Continue reading

Breaking Up a Monolithic Rails App with an Engine and a Shared Database

When we first built the app that became Keylime Toolbox, we used Ruby on Rails because it was what the team was familiar with and it gave us lots of tools to get started quickly. Along the way, as we grew, we knew we were going to run into the dreaded “monolithic Rails app” problem if we didn’t separate out our code. So early on we created our own gems to segregate common, “low churn” code, and also spun up a second Rails app for a service interface as we added more data. We learned a whole lot in that first service including that data should not be segregated along a dimension, but that asynchronous and long-running tasks (batch data loading) make a great seam to split on.

Fast-forward a year or two as we brought that same asynchronous data loading to bear on a new set of data and we decided to take another stab at breaking up the app. With the previous service we had put data loading and analytics into a single service that “owned” the data source. This time around I wanted to split that up. Following the command-query separation pattern, I wanted a service that was responsible for data loading (and everything associated with that), “command” (or “mutate”), but I wanted to keep the analytics part, “query”, in the main analytics app. In order to do that I needed a database that was shared between the service and the main app.

Keylime Command-Query Service Model
Continue reading

An SEO Research Project on CTR and Rank

I did a fun research project at the day job last week. We analyzed nearly five million Google search queries to see how click through rates are affected by ranking and how averages apply across industry segments.

We determined that those whole-web or industry-wide CTR-by-rank charts that many marketers use to predict performance have little bearing on their specific site or topic.

Bottom line? We found that averages, even when segmented by query type, didn’t provide much actionable data for a specific site. When we compared averages to site-specific data, we didn’t find much that was similar.

However, we did find that average click through rates within a site tended to hold fairly steady, and so using actual averaged click through rates for your own site can be very useful for things like calculating market opportunity of new content investments, estimating impact of rankings changes, and pinpointing issues with the site’s listings on the search results page.

Why we gave up on continuous integration

Four years and three companies ago we (I’ve worked with the same core team across these transitions) ditched our continuous integration server and we haven’t gone back. We spent too much time dealing with impedance mismatch between the CI environment and development/production. Instead we just keep our test suite short enough (runs in less than 2 minutes) so that developers run it often and “continuously”. And of course, with every merge and deploy.

Continue reading