PostgreSQL Server Benchmarks: Part Five — Disks

05 Jul 2011

Over the past few posts I’ve talked a lot about optimizing our new server’s basic systems. The RAM and CPU benchmarking and optimization has been relatively straightforward; the individual subsystems are either fast or they’re not. Our next stop is the disks, and the landscape isn’t quite so simple this time. While there aren’t as many options for the disks as there were for the CPUs, the changes made here make a more substantial impact on the system’s performance. The main bottleneck for any database is I/O, so the goal here is to make sure we’re not leaving any performance on the table.

In this exercise, I’m trying to answer two specific questions about the disk subsystem:

There’s a detailed recap of the relevant hardware below, but you might want to go back to Part Two for more details. Either way, here are links to the entire series:

Disk Hardware and Configuration

Let’s have a little refresher on the disks in the server. First, there are two controllers and one external enclosure in play. They are:

The SAS 6/iR is the controller for the drive bays in the server chassis itself. It’s a bit older and slower, running at 3 Gbps/second. The H800 is Dell’s current external RAID controller. It’s a more modern card running a 6 Gbps/second and sporting a 512MB Battery-Backed Write Cache. Connected to the H800 is Dell’s MD1220, an external JBOD with 24 2.5” drive slots.

The system drives are a pair of Western Digital WD1460BKFG 146GB 10k RPM SAS drives arranged in a RAID-1 and connected to the SAS 6/iR. They will not be tested in this series as all of the database disk access will happen on other drives.

We planned on placing the WAL storage on separate drives from the rest of the cluster, so we got 2x Fujitsu MBE2073RC 73GB 15k RPM SAS drives to be arranged in a RAID-1. The rest of the cluster data will live on 9x Seagate ST9146852SS 146GB 15k RPM SAS drives to be arranged in a RAID-10 (with 1 hot spare).

To start with, one of the WAL drives and all of the cluster drives were in the external array. The other WAL drive was in the chassis, to facilitate testing the two controllers. Speaking of which…

WAL Partition Controller Tests

Question number one: is the 6gbps H800 sufficiently faster than the 3gbps SAS 6/iR to justify putting the WAL drives in the external array? To test this, I want to measure the sequential write rates of drives in both configurations. Why just sequential writes? Turns out that’s all the WAL storage ever does… it just writes in a straight line forever and ever.

This was a simple test; I just formatted the drives with ext4 and ran bonnie++. I started by partitioning the drives with fdisk. I transcribed my input doing it by hand once:

d
1
n
p
1


w

This deletes partition #1 and creates a new primary partition #1, accepting the defaults for size (the two blank lines in the middle) and then writes out the partition table. fdisk will implicitly set the partition type to 83 (Linux) when you write out the tables. Put that in a file and you can pipe it to fdisk:

# fdisk /dev/sdi < wal.fdisk
# fdisk /dev/sdj < wal.fdisk

Next was to format and mount the drives:

# mkfs.ext4 /dev/sdi1
# mkfs.ext4 /dev/sdj1
# mkdir /mnt/3gbps
# mkdir /mnt/6gbps
# mount /dev/sdi1 /mnt/3gbps
# mount /dev/sdj1 /mnt/6gbps
# chmod 777 /mnt/[36]gbps

Then, run bonnie++:

$ bonnie++ -n 0 -f -b -c 4 -r 31744 -d /mnt/3gbps | tee ~/3gbps.bonnie
$ bonnie++ -n 0 -f -b -c 4 -r 31744 -d /mnt/6gbps | tee ~/6gbps.bonnie

Let me take a second to describe the options here:

The most interesting one is -r. By default, bonnie++ looks at how much RAM you have, doubles it, and writes that much data to disk. I haven’t seen any documentation directly stating this, but my guess is that this is an effort to bust the operating system’s caches. Normally you’d want this, as read tests when the data is in a memory cache are useless.

The problem for me is that my drives are only 64gb once formatted. Fortunately, I don’t really care about busting the OS cache because I’m only concerned with sequential write rates. That means that I can tell bonnie++ to use less disk without worrying too much about the cache screwing up my results. Speaking of which, let’s take a look at the SAS 6/iR:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   4     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
db01            62G           68781   8 125098   9           5352333  99 948.2   6
Latency                        1646ms    5768ms               195us     100ms

And the H800:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   4     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
db01            62G           92822  10 127608   9           5275436  99 825.0 103
Latency                        1750ms     493ms               194us    1105us

The main number we’re looking at is the Sequential Output Block rate. For the SAS/6iR, it’s 67.1 MB/s. The H800 manages 90.6 MB/s, about a 35% increase. That’s enough for me; the WAL disks are going in the external array. I moved the disks into the array, created a RAID-1 volume on the H800, and re-ran bonnie++ to make sure nothing changed:

$ bonnie++ -n 0 -f -b -c 4 -r 31744 -d /mnt/wal
[output snipped]
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   4     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
db01            62G           81065  10 126596  10           5459792  99 826.7 103
Latency                        1947ms     490ms               196us    1512us

Excellent. A bit slower than the single-drive test, but I’ve noticed as much as 8 MB/s variance across multiple tests so I’m not too worried. Still faster than the SAS 6/iR. That done, time to move on.

Cluster Partition RAID tests

The next big question is whether software RAID was going to be faster than hardware RAID for the cluster data partition. This meant putting the eight data drives into a software RAID-10, running some benchmarks, then putting them into a RAID-10 using the H800 RAID controller and doing the same benchmark.

For both configurations, the resulting virtual disks were formatted with stock ext4 and mounted at /mnt/data. The hardware RAID-10 was built using the RAID controller’s BIOS, which is naturally difficult to show you here. If you’re playing along at home, it’s pretty obvious how to do it. The software RAID, on the other hand, was done on the command line.

As best I can tell, the H800 doesn’t have a JBOD mode, so I exported each drive as a single-drive RAID-0 array. This effectively means that the controller would be handling the write caching but everything else would be the OS’s responsiblity.

Creating the array was pretty easy, but took a few steps. First, I created the device with mknod:

# mknod /dev/md0 b 9 0

This creates a block device (b parameter) with major number 9 (md devices) and minor number 0 (indicating the first device, hence /dev/md0).

Next, I partitioned each of the drives (sda through sdh) by piping a script to fdisk as described above. The script was a little different:

n
p
1


t
fd
w

The main difference from above is that this sets the drive type to fd, which is the Linux software RAID type. fdisk was run across the drives with a for loop:

$ for i in a b c d e f g h; do sudo fdisk /dev/sd$i < fdisk-md.script; done

Next up was to actually create the array. This was done using mdadm:

$ sudo mdadm --create /dev/md0 -v --raid-devices=8 --level=raid10 /dev/sd[a-h]1

This got the ball rolling, but before testing we have to wait for the array to be assembled. This involves checking /proc/mdstat periodically to see what’s up:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid10 sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
      570917376 blocks 64K chunks 2 near-copies [8/8] [UUUUUUUU]
      [==>..................]  resync = 12.5% (71771392/570917376) finish=41.5min speed=200000K/sec

unused devices: <none>

45 minutes later, I returned and the array was ready:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid10 sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
      570917376 blocks 64K chunks 2 near-copies [8/8] [UUUUUUUU]

unused devices: <none>

Finally, format it as before, with basic, default ext4 and mount it:

# mkfs.ext4 /dev/md0
# mkdir /mnt/data
# mount /dev/md0 /mnt/data
# chmod 777 /mnt/data

Next up is to run some tests. Following along with the advice in the PostgreSQL 9.0 High Performance book, I decided to run three sets of tests: seek and sync rates using sysbench, and a more general disk benchmark using bonnie++.

Back in Part Three I noticed that it was difficult to get consistent results across multiple test runs, and in response I took to running each test multiple times and averaging the results. I’ve continued this strategy here; each test was run five times and the results averaged. Before each test, the operating system disk caches were flushed by writing 3 to /proc/sys/vm/drop_caches. To facilitate all of this, I wrote a script:

#!/bin/sh

LABEL=software-raid
COUNT=5

########################################################################

echo "preparing sysbench seeks"
cd /mnt/data
rm *
sysbench --test=fileio --file-num=500 --file-total-size=500G prepare > /dev/null

echo "running sysbench seeks"
i=1
while [ $i -le $COUNT ]; do
    sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
    sysbench --test=fileio --num-threads=1 --file-num=500 --file-total-size=500G \
             --file-block-size=8K --file-test-mode=rndrd --file-fsync-freq=0     \
             --file-fsync-end=no run --max-time=60 > /home/ben/results/${LABEL}.sysbench-seek.$i
    echo $i
    i=$((i + 1))
done

sysbench --test=fileio --file-num=500 cleanup

########################################################################

echo "fsync rate"
i=1
while [ $i -le $COUNT ]; do
    sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
    sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 \
             --file-test-mode=rndwr run > /home/ben/results/${LABEL}.sysbench-fsync.$i
    echo $i
    i=$((i + 1))
done

########################################################################

echo "bonnie++"
i=1
while [ $i -le $COUNT ]; do
    sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
    bonnie++ -b -c 4 -d /mnt/data > /home/ben/results/${LABEL}.bonnie.$i
    echo $i
    i=$((i + 1))
done

Some notes about these tests. First, the sysbench seeks test requires a setup and cleanup phase, where a specified number of files are created for the actual benchmark to seek across. In this case, I created 500 files for a total of 500GB, so each file was 1GB. For bonnie++, you’ll see that most of the option flags from the WAL tests are missing; I wanted a more comprehensive set of numbers for these tests.

Raw results are available in two flavors, software RAID and hardware RAID. Let’s pull out the relevant info:

  fsync (per sec) sysbench Seeks bonnie++ Seeks Seq. Write (k/s) Seq. Read (k/s)
software 8118.38 186.61 412.74 62163.4 320465
hardware 6861.67 209.09 368.46 67089.4 615134.8

From this table alone it looks like it’s a relatively even match except for fsync/sec and sequential reads. I ran the standard deviations of those two values and got some surprising results. The standard deviation of the software fsync/sec was 90.76, while hardware was at 1871.27, indicating that something went crazy there. The actual values for hardware were 3170.57, 7606.69, 8379.83, 7627.63, and 7523.14. It seems that the first test was running during some other operation that caused the performance to be quite poor. Removing that value results in an average of 7784.32 (and a standard deviation of 346.03), which seems quite a bit closer to the software number.

The most surprising thing was the sequential read test, where the hardware controller appears to be nearly twice as fast as the software RAID. The values back that up:

  Test 1 Test 2 Test 3 Test 4 Test 5
software 337376 321752 284567 324414 334216
hardware 655455 617670 557031 659125 586393

I don’t have a theory as to why the hardware RAID is so much faster at this. The best guess I can make is that being specialized hardware, it’s more heavily optimized than the software RAID system is. Regardless of the reasoning, though, double the sequential read more than outweighs the small variance in the other stats, so it looks like we’ve got a winner!

Summary, and Next Time on bleything.net…

Back up at the top of the post, I posed two questions:

The answers are “yes” and “enhhhhh”. It shouldn’t be surprising that a 3gbps controller would be slower than a 6gbps controller, so that one was kind of a gimme. The hardware vs. software RAID question is a little bit more interesting, though. The conclusion that I’ve drawn is that RAID controllers are not inherently superior to software RAID. We have one in our server, and we’re seeing better numbers, so I’m going to stick with it. If you’re speccing a new server, make sure you have the opportunity to run these kinds of tests, and send the controller back if it’s not worth it.

In the next post, I’ll be walking through optimizing the performance of the PostgreSQL database itself, as well as tweaking filesystems a bit. Should be a good time, keep your eyes peeled!

« go back