09 October 2010

To Branch, Or Not To Branch

In the first few 6.172 (Performance Engineering of Software Systems), the professors nailed into our heads the notion that branching was bad because of misprediction, or when the processor mispredicts the result of an if-statement. When it happens, the pipeline has to be emptied, which may be costly depending on the processor architecture (on the CSAIL cloud machines, which are powered by i7s, the processor pipeline is 14 levels deep. A branch misprediction costs approximately 16 cycles. (I'm not sure if branch misprediction cost to pipeline depth is 1:1.)) There are numerous examples to illustrate bad branching, namely quicksort, among other things. In any case, I was looking through evilwm source code at 01:30 this morning and came across the use of abs(). I thought to myself, `Gee, this looks easily exploitable with a bit hack. I wonder if the C math implementation uses branching. Well, there's only one way to find out!'

I wrote a program that would compare four different implementations of the absolute value function:
  1. inlined bithack
  2. inlined branching
  3. <math.h> implementation
  4. macro bithack
(The bithack is to find the absolute value of i is i + (i < 0) * (-2*i). Clever, huh? :)

The program was compiled with the following command:

gcc -o abs fast_abs.c -O0

and run with the following command:

pnqsub time ./abs i

where i was an integer from 0 to 3 indicating which algorithm should be tested.

Runtimes


The testing methodology was not the best; I timed the programs with the UNIX time command instead of using 6.172's ktiming framework (because I couldn't get the darn framework to cooperate with the linker). In any case I present (possibly inaccurate, but interesting) data:

AlgorithmTime (s)
15.20
24.88
32.61
42.62

At first, I was like, woah, how does inlined branching go 5% faster than inlined branching? I should probably delve into the assembly and check it out. The second thing of note is that the abs() in <math.h> does probably the same thing as macro-ed bithack.

Branch Misses


The next set of data I collected was branch miss ratios from perf, a Linux performance counter utility.

The command used to collect perf data was:

pnqsub perf stat -e branches -e branch-misses ./abs i

where i was an integer from 0 to 3 indicating which algorithm should be tested.

AlgorithmBranchesBranch Misses
130009493718268
245008992477919
310005008267316
410005114796681

All four algorithms have very little branch misses! As expected, #2 has more branches than #1, but the branch predictor is too good to incur any branch misses. Darn.

Cache Misses


The last set of data I collected was cache miss ratios (again from perf purely for my amusement.

The command used to collect perf data was:

pnqsub perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./abs i

where i was an integer from 0 to 3 indicating which algorithm should be tested.

AlgorithmLoadsMisses
1800155290439584
2800147747837913
3200083160425153
4200083986024283

All four algorithms have a very low percentage of cache misses! But oh my, #1 and #2 have a massively higher number of cache loads and more misses and #3 and #4 have very similar numbers!

Conclusion


Overall, the [possibly inaccurate] data suggest that the math library's absolute value function is more or less just the bit hack macro. The processor's branch prediction is complex enough to correctly predict the branches in #2, thus having a less of a cost than computing the bithack. The experiment also says loudly and clearly that calling user-defined functions (even when inlined) incurs a substantial overhead; maybe declaring them static would speed up the program. Finally, perhaps writing four different programs, each with its own algorithm, would make a difference.

#include <stdio.h>
#include <math.h>
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>
#include "ktiming.h"

#define FAST_ABS(i) ((i) + ((i) < 0) * (-2*(i)))
#define MAXN 10000000
#define REP 100

inline int fast_abs(int i)
{
 return i + (i < 0) * (-2*i);
}

inline int slow_abs(int i)
{
 if(i < 0)
 {
  return -i;
 }
 return i;
}

int main(int argc, char **argv)
{
 printf("argc = %d\n", argc);
 if(argc < 1)
 {
  fprintf(stderr, "Usage: ./abs i\n");
  return 1;
 }
 int cmd = atoi(argv[1]);
 int i, rep;
#ifdef KTIMING
 clockmark_t time1, time2;
#endif
 switch(cmd)
 {
  case 0: // fast abs
#ifdef KTIMING
   time1 = ktiming_getmark();
#endif
   for(rep = 0; rep < REP; ++rep)
   {
    for(i = -MAXN/2; i < MAXN/2; ++i)
    {
     fast_abs(i);
    }
   }
#ifdef KTIMING
   time2 = ktiming_getmark();
#endif
   break;
  case 1: // slow abs
#ifdef KTIMING
   time1 = ktiming_getmark();
#endif
   for(rep = 0; rep < REP; ++rep)
   {
    for(i = -MAXN/2; i < MAXN/2; ++i)
    {
     slow_abs(i);
    }
   }
#ifdef KTIMING
   time2 = ktiming_getmark();
#endif
   break;
  case 2: // math.h:abs
#ifdef KTIMING
   time1 = ktiming_getmark();
#endif
   for(rep = 0; rep < REP; ++rep)
   {
    for(i = -MAXN/2; i < MAXN/2; ++i)
    {
     abs(i);
    }
   }
#ifdef KTIMING
   time2 = ktiming_getmark();
#endif
   break;
  case 3: // macro abs
#ifdef KTIMING
   time1 = ktiming_getmark();
#endif
   for(rep = 0; rep < REP; ++rep)
   {
    for(i = -MAXN/2; i < MAXN/2; ++i)
    {
     FAST_ABS(i);
    }
   }
#ifdef KTIMING
   time2 = ktiming_getmark();
#endif
   break;
  default:
   fprintf(stderr, "This isn't supposed to happen.\n");
   break;
 }
#ifdef KTIMING
 float elapsedf = ktiming_diff_sec(&time1, &time2);
 printf("Elapsed execution time: %f sec\n", elapsedf);
#endif
 return 0;
}

No comments:

Post a Comment