Performance is a Science

For 2,000 years scholars held that heavier objects fall faster than lighter ones, partly because Aristotle couldn’t be bothered to take 2 minutes to experiment. Hell, he even wrote that men have more teeth than women. Isn’t that crazy? And yet, people often rely on this kind of fact-free reasoning to arrive at conclusions about computer performance, among other things. Worse, they spend their IT budgets or sacrifice code clarity based on these flawed ideas. In reality computers are far too complex for anyone to handle performance problems by “reasoning” alone.

Galileo On Pisa (This story is not true by the way)

Think about a routine in a modern jitted language. Right off the bat you face hidden magic like type coercion, boxing, and unboxing. Even if you know the language intimately, unknowns are introduced as your code is optimized first by the compiler, then again by the JIT compiler. It is then fed to the CPU, where optimizations such as branch prediction, memory prefetching and caching have drastic performance implications. What’s worse, much of the above can and does change between different versions of compilers, runtimes, and processors. Your ability to predict what is going to happen is limited indeed.

To take another example, consider a user thinking of RAID-0 to boost performance. Whether there are any gains depends on a host of variables. What are the patterns of the I/O workload? Is it dominated by seeks and random operations, or is there a lot of streaming going on? Reads or writes? How does the kernel I/O scheduler play into it? How smart are the RAID controller and drivers? How will a journaling file system impact performance given the need for write barriers? What stripe sizes and file system block sizes will be used? There are way too many interdependent factors and interactions for speculative analysis. Even kernel developers are stumped by surprising and counterintuitive performance results.

Measurement is the only way to go. Without it, you’re in the speculation realm of performance tuning, the kingdom of fools and the deluded. But even measurement has its problems. Maybe you’re investigating a given algorithm by running it thousands of times in a row and timing the results. Is that really a valid test? By doing so you are measuring a special case where the caches are always hot. Do the conclusions hold in practice? Most importantly, do you know what percentage of time is spent in that algorithm in the normal use of the application? Is it even worth optimizing?

LHC - CMS Detector

Or say you’ve got a fancy new RAID-0 set up. You run some benchmark that writes large globs of data to the disk and see that your sustained write throughput is twice that of a single disk. Sounds great, too bad it has no bearing on most real-world workloads. The problem with the naive timing test and the benchmark is that they are synthetic measurements. They are scarcely better than speculation.

To tackle performance you must make accurate measurements of real-world workloads and obtain quantitative data. Thus we as developers must be proficient using performance measurement tools. For code this usually means profiling so you know exactly where time is being spent as your app runs. When dealing with complex applications, you may need to build instrumentation to collect enough data. Tools like Cachegrind can help paint a fuller picture of reality.

For website load times and networks you might use tools like WireShark and Fiddler, as Google did for GMail. In databases, use SQL profiling to figure out how much CPU, reading, and writing each query is consuming; these are more telling than the time a query takes to run since the query might be blocked or starved for resources, in which case elapsed time doesn’t mean much. Locks and who is blocking who are also crucial in a database. When looking at a whole system, use your OS tools to record things such as CPU usage, disk queue length, I/Os per second, I/O completion times, swapping activity, and memory usage.

In sum, do what it takes to obtain good data and rely on it. I’m big on empiricism overall, but in performance it is everything. Don’t trust hearsay, don’t assume that what held in version 1 is still true for version 2, question common wisdom and blog posts like this one. We all make comical mistakes, even Aristotle did. Naturally, it takes theory and analysis to decide what to measure, how to interpret it, and how to make progress. You need real-world measurement plus reasoning. Like science.

Comments

10 Responses to “Performance is a Science”

  1. Anonymous on December 18th, 2008 1:37 pm

    Your own source, the indubitable wikipedia, says “more massive” in the description of Aristotle’s error. Mass is not equivalent to weight, as two minutes with a dictionary might tell you.

  2. Sterling Camden on December 18th, 2008 1:40 pm

    Very true indeed. I remember learning this lesson back in 1984. We were looking at upgrading the version of the language we were using, and the upgrade was a fairly radical redesign of the language’s internals. So we did a lot of benchmarking. We tested every operation we could think of, and they all came out faster in the new version. So we converted about 800 clients (all at once — another mistake) and their systems started crawling.

    We hadn’t taken into account a change they made to memory management that, in a large, real world application caused almost constant thrashing.

    That’s when I learned that you must test the performance of the entire system under real conditions. Contrived benchmarks may be good for finding out where performance breaks down, but not if.

  3. Gustavo Duarte on December 18th, 2008 3:17 pm

    @Anonymous:

    A clear distinction between mass and weight in the West came only with Newton’s Principia, about 2,000 years after Aristotle had died. Galileo himself did not have a clear concept of mass. See “Concepts of Mass in Contemporary Physics and Philosophy”.

    The word “massive” does not appear a single time in Aristotle’s Physics (eg, see http://etext.library.adelaide.edu.au/a/aristotle/physics/complete.html) or in the Heavens (eg, see http://classics.mit.edu/Aristotle/heavens.3.iii.html).

    Aristotle uses “light” and “heavy” repeatedly in his discussion, not “massive”. But given that the concept of mass is nowhere near developed in the works, your point is moot, _even_ if the word were in the books.

    @Sterling: welcome back. That’s a good story – nothing beats examples from the trenches. I liked the school bus prank :)

  4. Samuli on December 23rd, 2008 5:08 am

    Because of the complexity of any real world system, I feel that the optimizations made in the code should focus on the computational complexity of the algorithms used. There the gain and pain can be “reasoned”.

  5. Gustavo Duarte on December 23rd, 2008 9:36 am

    @Samuli: great point.

    I was focusing on hardware aspects of performance, and ignoring the mathematical aspects.

    The opposite of what I describe here is real as well: people tweaking the trees of routine-level performance and missing the forest of algorithmic complexity. It’s foolish to ignore theory.

    It takes both. Cheers.

  6. Alex Railean on December 24th, 2008 3:59 am

    I think you should also mention powertop, it can be used to measure the impact of your changes on power consumption. In fact, there are several popular cases that most people have heard of – a blinking cursor in a text editor, and an “idle time measurer” in an instant messenger.

    This is especially important for laptops, but we shouldn’t neglect this on desktops either.

    If someone can recommend a similar utility for Windows, please do so.

  7. Gustavo Duarte on December 29th, 2008 1:23 am

    @Alex: that’s a great point. I’m going to write a post on this, optimizing for power. I don’t know of a Windows powertop, added to the research queue.

  8. Software Quality Digest – 2009-02-04 | No bug left behind on February 4th, 2009 12:48 pm

    [...] Performance is a Science – Gustavo Duarte on performance factors and why measuring your code is a must [...]

  9. Jean-Marc on February 5th, 2009 5:41 pm

    This blog post is just too true, every bit of it. I know, I worked on performance tuning of rather complex systems such as large clusters of file servers, it was a lot of fun.

    One particular performance problem I investigated back then comes back to my mind: we had results while running a rather simple filesystem benchmark that our customer had written for acceptance tests on a big computing cluster. This was on a parallel filesystem, meaning many clients accessed files on a set of servers provinding a single unified namespace. We knew this benchmark well, it was very simple, and we had previously used it for acceptance of another bigger cluster. We had good experience with the filesystem software as well: a complex piece of software, but well written, with lots of statistics and useful tuning knobs to play with. Some of the hardware in this cluster was new to us, especially the interconnect (ie. the high speed network), so we looked into that as much as we could (and we could not much, really). To make things harder, we could not reproduce the problem in our labs, it would only show up with a large number of nodes (about 100 I think).

    So where was the problem? Actually in a number of places:
    – the parallel benchmark used a rather unefficient communication scheme to gather its statistics (a really tiny amount of data, but sent all at once to a single node)
    – the communication libraries for this new type of interconnect had some parameters ill-suited for a big cluster, especially with the scheme used in the benchmark
    – the interconnect hardware reacted in very strange ways in this particular case: it would flood some nodes on the network with low-level error messages
    – this flood led to higher-level protocol errors that caused connection disruptions for the filesystem software

    Now the funny thing is, the hardware had absolutely no counter for this kind of error, it was invisible to software, so we could only have spotted this behaviour with some expensive logic analyzer (of course we never got a budget for this). So how did we solve it? Through many tests, a lot of sweat, and mail exchanges with one expert for this interconnect who eventually guessed (after asking several questions of course) what was going on. Halleluja.

    It took us over two months of investigation to find the root cause, and the solution was immediate (set a parameter in the communication libs). But if we had had useful error counters, I bet the problem would have been solved in a week or two at most.

    So my conclusion is: statistics and error counters are vital to debugging and performance analysis. :) But too few pieces of software (or hardware) are built with field problem analysis in mind… That’s why in this job you still need a top-notch crystal ball (some call that “experience”).

  10. Charlie on December 1st, 2011 2:50 pm

    Excellent article! Reminds me of “Don’t believe what your teacher tells you just because he is your teacher.” – Buddha
    By which he meant, check things out for yourself. Do an empirical test to see if it works.

Leave a Reply