# Profile guided optimization (pgo) `pgo` is an optimization technique to optimize a program for its usual workload. It is applied in two phases: 1. Collect profiling data (best with representative benchmarks). 1. Optimize program based on collected profiling data. The following simple program is used as demonstrator. ```c #include #define NOINLINE __attribute__((noinline)) NOINLINE void foo() { puts("foo()"); } NOINLINE void bar() { puts("bar()"); } int main(int argc, char *argv[]) { if (argc == 2) { foo(); } else { bar(); } } ``` ## clang On the actual machine with `clang 15.0.7`, the following code is generated for the `main()` function. ```x86asm # clang -o test test.c -O3 0000000000001160
: 1160: 50 push rax ; Jump if argc != 2. 1161: 83 ff 02 cmp edi,0x2 1164: 75 09 jne 116f ; foor() is on the hot path (fall-through). 1166: e8 d5 ff ff ff call 1140 <_Z3foov> 116b: 31 c0 xor eax,eax 116d: 59 pop rcx 116e: c3 ret ; bar() is on the cold path (branch). 116f: e8 dc ff ff ff call 1150 <_Z3barv> 1174: 31 c0 xor eax,eax 1176: 59 pop rcx 1177: c3 ret ``` The following shows how to compile with profiling instrumentation and how to optimize the final program with the collected profiling data ([llvm pgo][llvm-pgo]). The arguments to `./test` are chosen such that `9/10` runs call `bar()`, which is currently on the `cold path`. ```bash # Compile test program with profiling instrumentation. clang -o test test.cc -O3 -fprofile-instr-generate # Collect profiling data from multiple runs. for i in {0..10}; do LLVM_PROFILE_FILE="prof.clang/%p.profraw" ./test $(seq 0 $i) done # Merge raw profiling data into single profile data. llvm-profdata merge -o pgo.profdata prof.clang/*.profraw # Optimize test program with profiling data. clang -o test test.cc -O3 -fprofile-use=pgo.profdata ``` > NOTE: If `LLVM_PROFILE_FILE` is not given the profile data is written to > `default.profraw` which is re-written on each run. If the `LLVM_PROFILE_FILE` > contains a `%m` in the filename, a unique integer will be generated and > consecutive runs will update the same generated profraw file, > `LLVM_PROFILE_FILE` can specify a new file every time, however that requires > more storage in general. After optimizing the program with the profiling data, the `main()` function looks as follows. ```x86asm 0000000000001060
: 1060: 50 push rax ; Jump if argc == 2. 1061: 83 ff 02 cmp edi,0x2 1064: 74 09 je 106f ; bar() is on the hot path (fall-through). 1066: e8 e5 ff ff ff call 1050 <_Z3barv> 106b: 31 c0 xor eax,eax 106d: 59 pop rcx 106e: c3 ret ; foo() is on the cold path (branch). 106f: e8 cc ff ff ff call 1040 <_Z3foov> 1074: 31 c0 xor eax,eax 1076: 59 pop rcx 1077: c3 ret ``` ## gcc With `gcc 13.2.1` on the current machine, the optimizer puts `bar()` on the `hot path` by default. ```x86asm 0000000000001040
: 1040: 48 83 ec 08 sub rsp,0x8 ; Jump if argc == 2. 1044: 83 ff 02 cmp edi,0x2 1047: 74 0c je 1055 ; bar () is on the hot path (fall-through). 1049: e8 22 01 00 00 call 1170 <_Z3barv> 104e: 31 c0 xor eax,eax 1050: 48 83 c4 08 add rsp,0x8 1054: c3 ret ; foo() is on the cold path (branch). 1055: e8 06 01 00 00 call 1160 <_Z3foov> 105a: eb f2 jmp 104e 105c: 0f 1f 40 00 nop DWORD PTR [rax+0x0] ``` The following shows how to compile with profiling instrumentation and how to optimize the final program with the collected profiling data. The arguments to `./test` are chosen such that `2/3` runs call `foo()`, which is currently on the `cold path`. ```bash gcc -o test test.cc -O3 -fprofile-generate ./test 1 ./test 1 ./test 2 2 gcc -o test test.cc -O3 -fprofile-use ``` > NOTE: Consecutive runs update the generated `test.gcda` profile data file > rather than re-write it. After optimizing the program with the profiling data, the `main()` function ```x86asm 0000000000001040 : ; bar() is on the cold path (branch). 1040: e8 05 00 00 00 call 104a <_Z3barv> 1045: e9 25 00 00 00 jmp 106f 0000000000001060
: 1060: 51 push rcx ; Jump if argc != 2. 1061: 83 ff 02 cmp edi,0x2 1064: 0f 85 d6 ff ff ff jne 1040 ; for() is on the hot path (fall-through). 106a: e8 11 01 00 00 call 1180 <_Z3foov> 106f: 31 c0 xor eax,eax 1071: 5a pop rdx 1072: c3 ret ``` [llvm-pgo]: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization