Doesn't Exist: To Branch, Or Not To Branch

In the first few 6.172 (Performance Engineering of Software Systems), the professors nailed into our heads the notion that branching was bad because of misprediction, or when the processor mispredicts the result of an if-statement. When it happens, the pipeline has to be emptied, which may be costly depending on the processor architecture (on the CSAIL cloud machines, which are powered by i7s, the processor pipeline is 14 levels deep. A branch misprediction costs approximately 16 cycles. (I'm not sure if branch misprediction cost to pipeline depth is 1:1.)) There are numerous examples to illustrate bad branching, namely quicksort, among other things. In any case, I was looking through evilwm source code at 01:30 this morning and came across the use of abs(). I thought to myself, `Gee, this looks easily exploitable with a bit hack. I wonder if the C math implementation uses branching. Well, there's only one way to find out!'

I wrote a program that would compare four different implementations of the absolute value function:

inlined bithack
inlined branching
<math.h> implementation
macro bithack

(The bithack is to find the absolute value of i is i + (i < 0) * (-2*i). Clever, huh? :)

The program was compiled with the following command:

gcc -o abs fast_abs.c -O0

and run with the following command:

pnqsub time ./abs i

where i was an integer from 0 to 3 indicating which algorithm should be tested.

Runtimes

The testing methodology was not the best; I timed the programs with the UNIX time command instead of using 6.172's ktiming framework (because I couldn't get the darn framework to cooperate with the linker). In any case I present (possibly inaccurate, but interesting) data:

Algorithm	Time (s)
1	5.20
2	4.88
3	2.61
4	2.62

At first, I was like, woah, how does inlined branching go 5% faster than inlined branching? I should probably delve into the assembly and check it out. The second thing of note is that the abs() in <math.h> does probably the same thing as macro-ed bithack.

Branch Misses

The next set of data I collected was branch miss ratios from perf, a Linux performance counter utility.

The command used to collect perf data was:

pnqsub perf stat -e branches -e branch-misses ./abs i

where i was an integer from 0 to 3 indicating which algorithm should be tested.

Algorithm	Branches	Branch Misses
1	3000949371	8268
2	4500899247	7919
3	1000500826	7316
4	1000511479	6681

All four algorithms have very little branch misses! As expected, #2 has more branches than #1, but the branch predictor is too good to incur any branch misses. Darn.

Cache Misses

The last set of data I collected was cache miss ratios (again from perf purely for my amusement.

The command used to collect perf data was:

pnqsub perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./abs i

where i was an integer from 0 to 3 indicating which algorithm should be tested.

Algorithm	Loads	Misses
1	8001552904	39584
2	8001477478	37913
3	2000831604	25153
4	2000839860	24283

All four algorithms have a very low percentage of cache misses! But oh my, #1 and #2 have a massively higher number of cache loads and more misses and #3 and #4 have very similar numbers!

Conclusion

Overall, the [possibly inaccurate] data suggest that the math library's absolute value function is more or less just the bit hack macro. The processor's branch prediction is complex enough to correctly predict the branches in #2, thus having a less of a cost than computing the bithack. The experiment also says loudly and clearly that calling user-defined functions (even when inlined) incurs a substantial overhead; maybe declaring them static would speed up the program. Finally, perhaps writing four different programs, each with its own algorithm, would make a difference.

#include <stdio.h>
#include <math.h>
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>
#include "ktiming.h"

#define FAST_ABS(i) ((i) + ((i) < 0) * (-2*(i)))
#define MAXN 10000000
#define REP 100

inline int fast_abs(int i)
{
 return i + (i < 0) * (-2*i);
}

inline int slow_abs(int i)
{
 if(i < 0)
 {
  return -i;
 }
 return i;
}

int main(int argc, char **argv)
{
 printf("argc = %d\n", argc);
 if(argc < 1)
 {
  fprintf(stderr, "Usage: ./abs i\n");
  return 1;
 }
 int cmd = atoi(argv[1]);
 int i, rep;
#ifdef KTIMING
 clockmark_t time1, time2;
#endif
 switch(cmd)
 {
  case 0: // fast abs
#ifdef KTIMING
   time1 = ktiming_getmark();
#endif
   for(rep = 0; rep < REP; ++rep)
   {
    for(i = -MAXN/2; i < MAXN/2; ++i)
    {
     fast_abs(i);
    }
   }
#ifdef KTIMING
   time2 = ktiming_getmark();
#endif
   break;
  case 1: // slow abs
#ifdef KTIMING
   time1 = ktiming_getmark();
#endif
   for(rep = 0; rep < REP; ++rep)
   {
    for(i = -MAXN/2; i < MAXN/2; ++i)
    {
     slow_abs(i);
    }
   }
#ifdef KTIMING
   time2 = ktiming_getmark();
#endif
   break;
  case 2: // math.h:abs
#ifdef KTIMING
   time1 = ktiming_getmark();
#endif
   for(rep = 0; rep < REP; ++rep)
   {
    for(i = -MAXN/2; i < MAXN/2; ++i)
    {
     abs(i);
    }
   }
#ifdef KTIMING
   time2 = ktiming_getmark();
#endif
   break;
  case 3: // macro abs
#ifdef KTIMING
   time1 = ktiming_getmark();
#endif
   for(rep = 0; rep < REP; ++rep)
   {
    for(i = -MAXN/2; i < MAXN/2; ++i)
    {
     FAST_ABS(i);
    }
   }
#ifdef KTIMING
   time2 = ktiming_getmark();
#endif
   break;
  default:
   fprintf(stderr, "This isn't supposed to happen.\n");
   break;
 }
#ifdef KTIMING
 float elapsedf = ktiming_diff_sec(&time1, &time2);
 printf("Elapsed execution time: %f sec\n", elapsedf);
#endif
 return 0;
}

Doesn't Exist

Pages

09 October 2010

To Branch, Or Not To Branch

Runtimes

Branch Misses

Cache Misses

Conclusion

No comments:

Post a Comment

Archive

Followers

About