PostgreSQL Server Benchmarks: Part Three — Memory
03 Mar 2011
Good day, interfriends! Welcome to part three of my series on benchmarking Estately’s new database server. In this post I’ll discuss the tools, methodology, and results of the memory benchmarks. If you’re just joining our program already in progress, you may want to go back and skim Part One, which describes the background of the project.
Here’s a constantly-updated table of contents for the full series:
- Part One: background, hardware, and methodology.
- Part Two: a detailed description of the test system’s configuration.
- Part Three: memory results.
- Part Four: CPU results.
- Part Five: disk results.
- Part Six: epilogue.
As you may recall from Part One, the server has 96GB of RAM, comprised of 12x 8GB sticks running at 1066MHz. The goal of this exercise is to find the BIOS settings that yield optimal RAM performance in synthetic benchmarks. For memory testing, I used memtest86+ v4.20 running from a USB stick.
memtest is single-threaded, so it only tests one CPU’s ability to access RAM. In order to test a situation closer to what we would see in production, I turned to a tool called STREAM which was originally designed for testing high-performance computing applications on very large machines.
Greg Smith (who you may remember from such parts as Part One) wrote a tool called stream-scaling that wraps STREAM to automate testing your system. Here’s a quote from the stream-scaling documentation:
stream-scaling automates running the STREAM memory bandwidth test on Linux systems. It detects the number of CPUs and how large each of their caches are. The program then downloads STREAM, compiles it, and runs it with an array size large enough to not fit into cache. The number of threads is varied from 1 to the total number of cores in the server, so that you can see how memory speed scales as cores involved increase.
The plan was to boot into memtest, let it run for 10 minutes, and record the results. Then I’d boot into the OS and run stream-scaling.
A Nasty Surprise
How did that actually go? Like so:
- Boot from the memtest stick.
- Immediately discover huge problem.
Here’s a little snippet from memtest’s output:
Memtest86+ v4.20
Core i7 (32nm) 2394 MHz
L1 Cache: 32K 79801 MB/s
L2 Cache: 256K 31500 MB/s
L3 Cache: 12288 21963 MB/s
Memory : 64G 5853 MB/s
Can you spot the problem? 64GB of RAM? No. Pretty sure I’ve got 96GB in there. I counted it myself. I go into the BIOS and go to Memory Settings and see that it’s set to “Advanced ECC Mode”.
As I understand it (ie, not that well) Advanced ECC Mode is kind of like RAID for your RAM. It basically bonds channels on the memory controllers together, which yields a wider, 128-bit bus. I don’t know why you’d want this; presumably there is a good reason. The end result, though, is that you end up only using two of the three memory controllers on the CPU, which means that four slots on the motherboard don’t do anything.
Recovering From Said Surprise, A Plan Emerges
Long story short, an alternate setting called “Optimizer Mode” causes the memory controllers to operate independently, thus giving us access to all 12 RAM slots and generally yielding faster performance for normal use. I made this change, re-ran memtest, and was back to 96GB of RAM.
There was another setting in the BIOS that looked interesting. Dell calls it “Node Interleaving”. Allow me to quote from the Dell™ PowerEdge™ R610 Systems Hardware Owner’s Manual:
If this field is Enabled, memory interleaving is supported if a symmetric memory configuration is installed. If Disabled, the system supports Non-Uniform Memory architecture (NUMA) (asymmetric) memory configurations.
I’m just going to go ahead and admit that I don’t understand NUMA at all. If you do, email me an explanation (address in the footer) and I’ll add it here. Doing some basic googling indicates that one might see higher performance with interleaving enabled, and a brief foray into the PostgreSQL mailing lists turn up more people complaining about NUMA than praising it.
Since Optimizer Mode is the only way to access all of the RAM, that left testing with interleaving both enabled and disabled. I went back to my original plan to get results from memtest and then boot into the OS and run stream-scaling like so:
$ rm -f stream && ./stream-scaling | tee results
stream-scaling automatically builds the stream binary with correct settings for the system as currently configured, so the first step is to remove the binary if one exists. I followed this procedure with interleaving disabled, then enabled interleaving and followed it again.
Initial Results
Here’s what memtest said:
- Optimizer Mode with Interleaving disabled — 6879 MB/s
- Optimizer Mode with Interleaving enabled — 8370 MB/s
I was pretty surprised by these results. I wasn’t expecting to see a 1.5GB/s jump just by enabling interleaving. Of course, as I mentioned above, memtest only gives one piece of the puzzle…
stream-scaling runs a transfer rate benchmark starting with a single core and ramping up to the total number of cores in your system… in my case, 16. Let’s see what that looks like:
This is where things start to get interesting. For one, the speeds are much, much faster than memtest. Not necessarily surprising as it’s a different tool with different internals. What’s more surprising is that while memtest showed a considerable speed boost with interleaving enabled, stream-scaling doesn’t agree.
A Better Test
I was also a bit concerned about the fluctuations in the data,
particularly around 5 and 8 cores. I figured that this was likely
due to other processes on the system getting in the way of getting a
“clean” result. The simplest way I could think to address that was
to run the benchmarks multiple times and average out the results. I
decided to run each benchmark twenty times, so I wrote a script called
multi-stream-scaling
to automate this process:
#!/usr/bin/env ruby
$stdout.sync = true
count, title = ARGV[0,2]
unless count and title
abort "Usage: #{$0} [run count] [test title]"
end
puts "Preparing..."
system "rm -f stream"
puts "Running warmup..."
system "./stream-scaling > /dev/null"
count.to_i.times do |run|
filename = "#{title}_run_#{run + 1}"
print "Starting run #{run + 1} of #{count}..."
system "./stream-scaling > #{filename}"
puts "completed. Results written to #{filename}"
end
This script does a couple of things. First it deletes the stream
binary, which will force stream-scaling to rebuild it. Next, it runs
stream-scaling once to warm up the system. Then, it runs stream-scaling
count
times in succession, storing the output of each in a file named
“<title>_run_<count>”. To run it, just give a number of
runs and a title to use for the reports:
$ ./multi-stream-scaling 20 interleaving_disabled
In order to get consistent results, I rebooted before running
multi-stream-scaling the first time. Once the runs were done, I used
stream-scaling’s stream-graph.py
to parse the results. It’s intended
to output data suitable for plotting with gnuplot, but it was useful
for my purposes as well.
I wrote another script called cruncher.rb
that ran each results file
through stream-graph.py and averaged the results, outputting CSV with
the number of cores, average transfer rate, and standard deviation.
#!/usr/bin/env ruby
module Enumerable
def sum
return self.inject(0) {|a,e| a + e.to_f }
end
def mean
return self.sum / self.size
end
def std_dev
return Math.sqrt( self.map {|n| (n - self.mean) ** 2 }.mean )
end
end
title = ARGV[0]
unless title
abort "Usage: #{$0} [report_name]"
end
results = {}
Dir[ "#{title}*" ].each do |file|
lines = IO.popen( "cat #{file} | ./stream-graph.py" ).readlines
lines.shift # remove header comment
lines.map {|l| l.split }.each do |cores, result|
results[ cores.to_i ] ||= []
results[ cores.to_i ] << result.to_f
end
end
puts "cores,avg,stddev"
results.sort_by {|k,v| k }.each do |cores, values|
puts [ cores, values.mean, values.std_dev ].join(",")
end
To run this, just provide the name of the report you used when you ran multi-stream-scaling:
$ ./cruncher.rb interleaving_disabled
Results Revisited
Let’s overlay the results from multi-stream-scaling on top of the results from before:
This tells me a couple of things:
- The spikiness of the data is still there, which means that it’s almost certainly “real”. Eight of the sixteen cores are hyperthreading cores, so they’re not really real CPUs, which could explain some of this.
- There are a couple of weird outliers (three, five, and sixteen cores) which indicates to me that there are external factors impacting our test runs.
Overall, it’s an encouraging result. Our single-run results are nearly identical to our twenty-run results. It also confirms that there was no funny business going on before; interleaving really is a bit slower.
Curious, I did some googling and quickly came across a white paper published by Dell called Optimal BIOS Settings for High Performance Computing with PowerEdge 11G Servers that describes some testing they performed and the conclusions they reached. It’s worth noting that they’re talking about HPC as formally defined so it’s not directly relevant to my use case, but I figured it couldn’t hurt to look it over.
Here’s a choice quote:
… node interleaving helped performance in three out of nine benchmarks by 2-4% and hurt performance on four of nine benchmarks by 4-13%. In summary, node interleave rarely helped, and helped very little when it did therefore, node interleaving should be disabled for typical HPC workloads.
The whitepaper finds that interleaving is not helpful for “typical HPC workloads” and advises that it be left off unless application-specific benchmarking shows that it’s helpful. Between that recommendation, the results shown above, and the fact that disabled is the default, I decided to proceed with my testing with interleaving disabled.
Summary, and Next Time on bleything.net…
The tl;dr version is that you need to check two things if you just picked up a big fat Dell server:
- make sure the RAM is set to “Optimizer Mode”.
- make sure that Node Interleaving is disabled.
If you’re interested, you can look at the raw output of the tests in three gists:
- Gist 852087 — memtest and stream-scaling
- Gist 852088 — multi-stream-scaling, interleaving disabled
- Gist 852089 — multi-stream-scaling, interleaving enabled
In Part Four, I’ll be working through a similar process to find the optimal CPU settings.
« go back