Branch target prediction and performance
I was recently analyzing the performance of DIRACBenchmark after having seen a peculiar speed-up in the transition from Sandy Bridge/Ivy Bridge architecture to Haswell/Broadwell. This benchmark basically generates a long sequence of random numbers via Python random module and performs some multiply and accumulate operations. What stood out after some initial tests was a 50% improvement in runtime when switching from Xeon v2 to Xeon v3 processors. All the other benchmarks I was familiar with were showing a difference that was in the range 10%-20%, but this Python script was definitely an outlier.
Systems under test
I was comparing the performance on the following two systems:
Ivy Bridge | Haswell | |
CPU | Dual socket E5-2650v2 @ 2.60 GHz | Dual socket E5-2640v3 @ 2.60 GHz |
8 physical cores, 16 threads, 2.8GHz Turbo | 8 physical cores, 10 threads, 3.4 GHz Turbo | |
RAM | 64 GiB DDR3, fully balanced NUMA domains | 128 GiB DDR4, fully balanced NUMA domains |
OS | CentOS 7.3 running kernel 3.10.0-514 | CentOS 7.3 running kernel 3.10.0-514 |
gcc | 4.8.5 | 4.8.5 |
CPU clk | 2.00GHz, acpi_cpufreq userspace | 2.00GHz, acpi_cpufreq userspace |
The clock frequency was set to 2.00GHz via userspace governor with acpi_cpufreq
driver. Disregarding for a moment the output of the benchmark, which is not
relevant, the runtime on the two platforms was the following:
The runtime highlights a speedup close to 50%.
Instruction set
To make sure the Python interpreter was making use of the same instruction set on both architectures, I
traced the benchmarks with Intel Software Development emulator. SDE is a dynamic
binary instrumentation tool capable of tracing the execution down to the assembly
instruction level, providing emulation support on Intel platforms for instructions
sets that are not natively supported by the hardware (e.g. it is possible to test
AVX512 code on a machine which provides only AVX2). Intel SDE is based on Pin
library, which uses ptrace
system call on Linux to inject instrumentation
code into the application. Each instruction of the application is replaced with
a jmp to the instrumentation, allowing to gain control over every single
instruction executed. Clearly, this has a bearing on the runtime.
The first result obtained with SDE was a profile of the functions executed by the benchmarks. Ivy Bridge functions-breakdown, limited to the first 10 contributors, was the following:
Haswell functions-breakdown, also limited to the first 10 contributors, was the following:
What immediately stood out from the traces above was the following:
- The two executions were pretty much identical
- The major contributor in terms of instructions executed was the Python interpreter, CPython, with
PyEval_EvalFrameEx
, the core function that executes Python bytecode _randommodule.so
contributed only for 3.5% of the total instructions, while all other contributions were mostly coming from CPython.
Results from relative comparison of instructions executed do not necessarily translate into
the same execution time comparison. It is reasonable to expect though that PyEval_EvalFrameEx
is
also the major contributor in terms of time. For each function sampled by SDE, I created a histogram
associating to each instruction, the number of occurrences. Two notes, before looking into the plots:
- I have restored the histograms from a previous backup and I haven’t preserved raw data. Unfortunately, these were also unlabeled, so I could not assign a posteriori the correct architecture to the red and blue lines. Nevertheless, what they suggest is that the instruction set used on Ivy Bridge and Haswell is exactly the same, so I did not go any further in rebuilding the dataset from scratch.
- I could not restore the plot for
_randommodule.so
, so it is missing from the list
Given these results, I considered worthwhile having a closer look at
the main contributor, PyEval_EvalFrameEx
.
Analyzing PyEval_EvalFrameEx
PyEval_EvalFrameEx
is the heart of the CPython interpreter, where
bytecode is decoded and executed. Its implementation resides in
Python/ceval.c
and makes use of a huge switch statement (1500 LOC)
that dispatches the current opcode to the corresponding case branch
for execution. In CPython 2.7.5, it looks like the following:
Switch constructs are normally implemented in assembly with jump
tables, assigning a label to each branch which constitutes the
target of a jumps instruction when a matching argument in the switch
statement is encountered. The jump is indirect: the destination
address is loaded in a register or taken directly from memory. This
is somehow coherent with the profile of PyEval_EvalFrameEx
, where
jmp is the seventh most recurrent instruction. Now, in a
further attempt to reproduce a workload that yields a 50% speedup on
Haswell architecture, I tried to write some basic code that would
replicate this indirect jmp intensive workload.
The code can be found at instr-miss-benchmark repository. The script implements a large switch statement dispatching in loop either a sequential opcode sequence or a random one. The results are as follows (the number of case branches is 512):
Ivy Bridge, sequential opcode, lenght of the sequence is 512
Ivy Bridge, random opcode, lenght of the sequence is 512
Haswell, sequential opcode, lenght of the sequence is 512
Haswell, random opcode, lenght of the sequence is 512
The comparison highlights that with a negligible delta
in the number of branches executed by the CPU, the number of branch
mispredictions is instead much higher on Ivy Bridge. A mispredicted branch is a
wrong guess taken by the front-end of the pipeline when deciding where to fetch the
next instruction. This prediction is made when fetching an instruction block: the
front-end tries to guess if the block will branch and, in this case, where it will land.
In fact, guessing the destination address is referred to as Branch Target Prediction,
as we have branching instruction (e.g. jump
) which uses an indirect operand of the type:
The prediction is done very early to avoid having to wait for the decode and execution stages of the pipeline. The obvious consequence is that the guess might be wrong. In that case, the instructions that have been fetched following a wrong prediction are flushed from the pipeline and the execution resumes from the correct control path.
When using a random opcode sequence of lenght 512, the number of mispredicted branches is 10 times higher on Ivy Bridge, with a difference in number of branches of only 0.2%. There is another interesting result that stands out: on Haswell, a sequential opcode sequence generates a higher number of mispredictions compared to a random one.