PostgreSQL Monitoring and Performance Tuning: Phase 2

This is a follow up on the post PostgreSQL Monitoring and Performance Tuning.

In order to get the most out of our servers I was tracking PostgreSQL’s use of memory and temporary disk space. It seemed that we were pushing the attached disks beyond their capabilities, so I set up a chart to track disk utilization. While I was able to increase work_mem and not see any deleterious effects, if we went too high we would run out of memory, so I set up a chart of percent of memory used.

screen-shot-2017-01-12-at-9-59-53-am

By monitoring these while I increased the work_mem, I found the point at which queries held on disk dropped to very little and disk utilization dropped from being pinned at 100%.

In addition to the charts in the original article I added two more charts.

Disk Utilization

First was for disk utilization. Looking at iostat I could see that the drive running the data was at nearly 100% all the time.

IOSTAT=$(iostat 5 2 -d -x sdb | tail -2 | grep sdb)
echo $IOSTAT | awk '{print "'$API_KEY'.postgres.'$HOST'.io_util " $14 " " strftime("%s")}' | nc $SERVER $PORT

This will log the percent utilization of device sdb to graphite for charting. The parameters for iostat tell it to produce two summaries on 5-second intervals. The first result from iostat is the average since the system started so we want a second sample to get the last 5 seconds’ activity.

I should note that iostat doesn’t provide accurate results from disk arrays (i.e. RAID) or SSDs because they can do tasks in parallel. As Marc Brooker points out the man page states:

Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

Memory Utilization

By increasing work_mem to 1.5G I was able to reduce the disk utilization (whatever that really means) and more importantly significantly reduce the queries held on disk. This also helped increase the speed of queries.

However, we still had queries that were running for 20 minutes or longer. While the system was mostly idle, I increased work_mem to 2G and then to 4G (it was really helpful to discover that I can change this parameter with only a reload, not having to restart the entire server). Each change proved successively better in similar (but not cached) queries. With the 4G limit I got the 20 minute queries down to less than 30 seconds. This is great!

But with hundreds of connections, all consuming 4G each, that could use up the server memory pretty quickly. When that happens the OOM killer could kill random queries (which affects the application) or even kill PostgreSQL. So I added a chart to track the amount of memory used (not including buffers or cache).

FREE=$(free | grep 'Mem:' | awk '{print ($3-$6-$7)/$2}')
echo $FREE | awk '{print "'$API_KEY'.postgres.'$HOST'.pct_used " $1 " " strftime("%s")}' | nc $SERVER $PORT

This line uses free to find the amount of used memory, subtracts the buffers and cache, then divides by the total memory. So far we’ve kept this below 20%.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s