PostgreSQL Server Benchmarks: Part Four — CPU

04 Apr 2011

CPU tuning is finicky. Modern CPUs are extremely complex, and server-class CPUs have tons of options used to tune for very specific usecases. Our Xeon E5620s have twelve options that can be toggled, and that’s before you even look at power management. To make matters worse, most of these settings make subtle changes that can only be meaningfully measured under actual application load. For that reason, this round of benchmarking will be focused on power management settings.

My goal for this exercise was to establish which of the power management profiles yielded the best performance in synthetic benchmarks. The main question I wanted to answer was whether BIOS settings could make the CPUs dramatically faster or slower. The result will be a baseline configuration suitable for disk and application benchmarking. Once we get into application benchmarking, the subtle changes mentioned above will become more relevant, and will be tested where appropriate.

Here’s a constantly-updated table of contents for the full series:

To prepare, I used the integrated systems management tools to reset the BIOS to defaults (as described in Part Two) and then set the memory to Optimizer Mode, following the findings from Part Three.

Power Management Settings

The BIOS supports three power management profiles:

Active Power Controller allows the BIOS to manage power settings for the CPU and memory. OS Control is similar, but the operating system is in charge. Maximum performance turns everything up to 11.

Last time, I mentioned a whitepaper that Dell published called Optimal BIOS Settings for High Performance Computing with PowerEdge 11G Servers. Glazing over most of the details, they conclude that the Active Power Controller mode is preferable for most loads, but Maximum Performance is useful for … well, maximum performance. Their testing appears to indicate that OS control is not a good option, but I tested it anyway.

Testing Methodology

As mentioned in Part One, I’ll be using sysbench’s CPU benchmark tools to test CPU performance. sysbench’s CPU mode calculates prime numbers up to a specified value, and can use as many threads as you like. Internally, sysbench uses the notion of “requests” to specify how much work should be done. To figure out what this meant, I had to dig into the code. Long story short, it appears that a “request” in the CPU benchmark means a single instance of calculating the primes up to the given value.

By default, sysbench uses a single thread to run 10,000 requests, each one consisting of calculating the primes between 3 and 20,000. These values are tunable, and tune them I did. With 16 cores running, the default test finishes in about one second, so I cranked the --cpu-max-prime parameter up to 50,000.

I also decided to run the test 32 times, incrementing the thread count each time until it was running two threads per (virtual) core. Finally, I ran each test with the default of 10,000 events, and again with 10000 * num_threads events. This results in sysbench making sure that each core gets (more or less) 10,000 events. It may or may not be interesting, we’ll see in the results.

sysbench reports its results like so:

$ sysbench --test=cpu --cpu-max-prime=50000 --num-threads=32 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 32

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 50000


Test execution summary:
    total time:                          9.2363s
    total number of events:              10000
    total time taken by event execution: 294.4824
    per-request statistics:
         min:                                 14.18ms
         avg:                                 29.45ms
         max:                                162.76ms
         approx.  95 percentile:              44.37ms

Threads fairness:
    events (avg/stddev):           312.5000/3.80
    execution time (avg/stddev):   9.2026/0.02

At this point in the process, I wasn’t exactly sure which values were going to be most interesting, so I captured all of them. To automate the test runs and data gathering, I wrote this script:

#!/usr/bin/env ruby
$stdout.sync = true

unless title = ARGV[0]
  abort "Usage: #{$0} [test title]"
end

@normal_report = File.open("#{title}_normal.csv", 'w')
@scaled_report = File.open("#{title}_scaled.csv", 'w')
@normal_report.puts "threads,total_time,num_events,event_execution_time,per_req_min,per_req_avg,per_req_max,per_req_perc,events_avg,events_dev,exec_time,exec_dev"
@scaled_report.puts "threads,total_time,num_events,event_execution_time,per_req_min,per_req_avg,per_req_max,per_req_perc,events_avg,events_dev,exec_time,exec_dev"

def extract_values( report )
  results = []

  results << report.match( /Number of threads: (\d+)/ )[1]
  results << report.match( /total time: {26}(\d+\.\d+)/ )[1]
  results << report.match( /total number of events: {14}(\d+)/ )[1]
  results << report.match( /total time taken by event execution: (\d+\.\d+)/ )[1]
  results << report.match( /min: +(\d+\.\d+)ms/ )[1]
  results << report.match( /avg: +(\d+\.\d+)ms/ )[1]
  results << report.match( /max: +(\d+\.\d+)ms/ )[1]
  results << report.match( /percentile: +(\d+\.\d+)ms/ )[1]
  results << report.match( /events \(avg\/stddev\): +(\d+\.\d+)\/(\d+\.\d+)/ )[1,2]
  results << report.match( /execution time \(avg\/stddev\): +(\d+\.\d+)\/(\d+\.\d+)/ )[1,2]

  return results.flatten.join(',')
end

1.upto(32) do |threads|
  cmd =  []
  cmd << "sysbench"
  cmd << "--test=cpu"
  cmd << "--cpu-max-prime=50000"
  cmd << "--num-threads=#{threads}"

  normal_cmd = cmd.join(' ') + ' run'
  scaled_cmd = cmd.join(' ') + " --max-requests=#{threads * 10_000} run"

  puts "Running `#{normal_cmd}`..."
  IO.popen( normal_cmd ) {|out| @normal_report.puts extract_values(out.read) }
  puts "Running `#{scaled_cmd}`..."
  IO.popen( scaled_cmd ) {|out| @scaled_report.puts extract_values(out.read) }
end

All told, a full test suite takes about 90 minutes to run, and produces two CSV files: one for the “normal” results, and one where the requests are scaled to match the thread count.

I tested the three power management profiles listed above. In order to switch between them, I used the Dell OpenManage tools I installed back in Part Two:

# omconfig chassis pwrmanagement config=profile profile=apc
# reboot
$ sysbench.rb apc

# omconfig chassis pwrmanagement config=profile profile=maxperformance
# reboot
$ sysbench.rb maxperformance

# omconfig chassis pwrmanagement config=profile profile=osctrl
# reboot
$ sysbench.rb osctrl

Each profile was set, the system rebooted, and the test script run.

Results

Let’s start by charting the simplest metric, time taken. Lower is better:

Normal Results


Scaled Results

Unfortunately, neither of these are terribly easy to read, in part due to the scaling of the chart. Let’s look again at a window, from 14-18 threads:

Normal Results (14-18 threads)


Scaled Results (14-18 threads)

Hmmm… still not very useful. The only thing we can really get out of this is that it appears that OS Control is regularly slower than the other two options, and Max Performance seems to be just that. But the values are so close that it’s really difficult to be definitive about it.

Let’s look at another set of data: the average time per request.

Normal Results


Scaled Results

Unfortunately, the results are still not telling us very much. For fun, here are charts from the same data slice as before, 14-18 threads:

Normal Results (14-18 threads)


Scaled Results (14-18 threads)

Yeah. Same results as before.

Summary, and Next Time on bleything.net…

Unfortunately, the power management results were pretty inconclusive. Frankly, I sort of expected this from the beginning, but didn’t want to make assumptions. The reality is that synthetic CPU benchmarks are largely useless for anything other than establishing a baseline for testing other settings.

That is, no contrived tests will ever indicate what kind of performance we can expect out of our (very real-world) application. All we can do is say “yep, in this configuration we scored X, and by tweaking it we scored Y”. While that’s certainly useful, it’s a bit premature at this point. For that reason, I’ve decided to leave the system on Active Power Controller, as for the most part it is only slightly “slower” than Maximum Performance, and Dell recommends using it.

The raw results are contained in six CSV files, which you can get from Gist 893071.

In Part Five, I’ll be testing the disk subsystem to try to find optimal starting points going into our application benchmarking phase. Stay tuned!

« go back