Loading ...
Sorry, an error occurred while loading the content.

Tests for a Simulator

Expand Messages
  • Andy Glew
    I ve posted in the past about my dissatsfaction with unit tests for the sorts of simulators I work on. I think that I ve come up with something, and I d like
    Message 1 of 1 , May 2, 2000
    • 0 Attachment
      I've posted in the past about my dissatsfaction with
      unit tests for the sorts of simulators I work on.
      I think that I've come up with something, and I'd
      like to run it past XPers for comments.

      Brief:
      ====

      Give in. Handwritten unit tests are *not* sufficient
      for simulators. Eyeballing of full system simulation
      results is the only thing that inspires confidence.

      So: Project full simulator tests to test patterns that
      can be applied to a unit in isolation. Automate eyeballing
      of patterns in inherently approximate data, when adding
      new functionality. When refactoring use exact differences
      testing.


      Detail:
      =====

      I work on microarchitectural performance simulators.
      These programs produce numerical answers that are
      inherently approximate - there is no right answer.
      Figuring out the answer is what we write the simulator
      to do.

      As I develop the simulator, adding features, the numbers
      that it calculates change. I.e. they change on an hourly basis
      when I am working well. Exact diff comparison to recorded
      output is toob fragile.

      Moreover, this sort of simulator isn't simulating a natural
      or prexisting process. If it were, I might have recorded data
      that I could check my simulator against. Instead, the simulator
      simulates things that haven't been built. (Special case:
      I can configure the simulator to model existing hardware, and
      check it against recorded results, but that isn't a good test
      of features that haven't been built yet.)

      I need to validate the performance projections, the numbers
      calculate, not just that the program runs , doesn't crash, etc.
      It happens far too often that the simulator runs the program under
      test correctly, but produces bogus performance numbers.

      Old test strategies include:

      * Handwritten tests.

      * Microbenchmarks specifically designed to tweak performance
      features. Great, but very hard to write. Like, PhD-worthy

      * Use of prexisting generic microbenchmarks like lmbench.
      Good, less labour intensive, but doesn't test new features.

      * Comparison to another simulator. Only works if you have
      another simulator.

      * Recording redundant performance data, and analyzing the
      redundant data to see if the numbers are consistent.

      * Random tests, boundary value tests, all the other good stuff

      * Regression tests, of course.

      Overall, these leave me unsatisfied. They don't make me feel
      confident that the code I am adding is correct. And confidence
      seems to be necessary to practice XP.

      So, over the past month or so, I have stepped back and tried
      to take notes on what gives me, or other computer architects,
      confidence in a simulator. And the answer is:

      Running lots of full system workloads. Recording the performance
      data. Analyzing the data, e.g. by graphing it. Looking at the graphs,
      and seeing if the shapes of the curves are as expected. Comparing
      to historical data, seeing if similar (although seldom exactly the same)
      curves were obtained. Anomalies in the curves are the first clue for bugs.
      Cross-checking with redundant measurements as above.

      There are two problems for unit testing here:
      (1) Full system workloads
      (2) Manual inspection of the results.

      First, (1) full system workloads.

      Problem (1a): time. Such simulations sometimes take days
      or weeks to run -- not exactly unit tests of 2 minutes. There's a lot
      of material in the literature on how to sample full simulations.
      Solvable.

      Problem (1b): full system. Not unit tests.

      Here's how I have started trying to attack problem (1b). I write as
      many ordinary unit tests as I can. When that leaves me unsatisfied,
      I run full system tests and look at the curves. Then, I automate
      gathering a trace at the unit boundary - a record of all the stimulus
      applied to the unit during the full system tests. I replay this trace
      back through the unit via the test jig, and record the performance
      results for analysis. I call this "projecting" a full system test
      to a unit test.

      This has several advantages: * the trace/replay is often significantly
      faster than the full system test; * the trace/replay jig is useful for handwritten
      tests.
      Disadvantage: * the traces are pretty damned big. The full system tests
      are relatively small, since they are generated by programs running on the
      simulator. The unit test vectors are bigger, since the rest of the simulator
      isn't around. (Some special case full system simulators can drive the
      unit tests without depending on recorded traces, but they don't test the
      aggressive stuff.)

      The biggest advantage is that the trace/replay jig can be used for exact
      diff comparison testing during refactoring. Exact diff testing is fragile when
      adding features to the simulator, but when I have automated the collection
      of golden reference output, and when I am religiously refactoring and not
      adding features, exact diff testing works. This has greatly increased
      my confidence while refactoring.

      Similarly, the "projection" of full system tests to unit tests via trace/replay
      also often allows exact diff tests to be used for a unit, even when adding
      features, since it isolates one unit from a feature in another unit that would
      cause the sequence of events to change in a real simulation. But, in general,
      exact diff comparison cannot be used when adding features.

      So, second, I need to examine (2) Manual inspection of the results
      as a barrier to automating simulator unit tests.

      It's rather hard to quantify what an experienced computer architect
      looks for when inspecting this sort of performance data. The curves
      have to "make sense".

      I went and asked the head of the university's statistics department
      about this. He told me that the field is "non-parametric regression",
      gave me a reference, but said that overall it involves AI-like pattern
      recognition, and that attempts to add AI-like pattern recognition
      to statistics all failed abysmally a decade ago. He said that human
      eyeballing is still unequalled for pattern detection, at least until
      boredom sets in.

      In a wierd way, this makes me feel better. I'm falling back to writing
      simple pattern checkers, like
      * I expect the performance to be monotonically increasing
      as cache size increases, flattening off at very large cache sizes
      * this distribution should have one peak.
      * all points should be within +/- 10% of the old historic data.
      * the new mean should be within 5% of the old mean.
      This isn't great: sometimes the tests really need to depend more exactly
      on what you are doing. For example, when I add feature A I may
      expect that performance only increase everywhere; when I add feature
      B I may expect that performance increase in some places and decrease
      in others. But it's a start.
      Probably the next most important thing will be to automate the
      preparation of such graphs of "standard" performance curves, so
      that I can eyeball them quickly, periodically. (Currently it often
      takes days to prepare the curves, because the output formats
      change; perl-SQL is helping.)

      This goes back to my roots. Back at the Little Software House on
      the Prairie, where others worked on automating UNIX OS testing,
      I worked on benchmarking. Automated OS performance tests
      mainly involves flagging deviations rather than looking for trends
      - trend analysis is mainly used when attacking performance problems,
      rather than on a day to day basis - but a lot of this stuff I wrote
      way back when.

      The key thing is developing a performance suite, just like
      a test suite.

      Another useful thing is having the "knobs" of my simulators
      self-advertize, so that I can automate the creation of performance
      tests. To many knobs are ineffective in some configurations.
    Your message has been successfully submitted and would be delivered to recipients shortly.