diff options
author | Johannes Stoelp <johannes.stoelp@gmail.com> | 2023-08-19 00:33:57 +0200 |
---|---|---|
committer | Johannes Stoelp <johannes.stoelp@gmail.com> | 2023-08-22 23:37:33 +0200 |
commit | fbbae01e8ccd29821a6b476cccb10220a7ed21ab (patch) | |
tree | c59dffab4d78768e545e5a37ae47752903b17cca /src/development | |
parent | e7b2cecc90ef105091de249698d8c61e50a8764b (diff) | |
download | notes-fbbae01e8ccd29821a6b476cccb10220a7ed21ab.tar.gz notes-fbbae01e8ccd29821a6b476cccb10220a7ed21ab.zip |
add pgo notes
Diffstat (limited to 'src/development')
-rw-r--r-- | src/development/README.md | 1 | ||||
-rw-r--r-- | src/development/pgo.md | 158 |
2 files changed, 159 insertions, 0 deletions
diff --git a/src/development/README.md b/src/development/README.md index dfea5cd..117ef17 100644 --- a/src/development/README.md +++ b/src/development/README.md @@ -9,3 +9,4 @@ - [symbol versioning](./symbolver.md) - [python](./python.md) - [gcov](./gcov.md) +- [pgo](./pgo.md) diff --git a/src/development/pgo.md b/src/development/pgo.md new file mode 100644 index 0000000..d1b3eb7 --- /dev/null +++ b/src/development/pgo.md @@ -0,0 +1,158 @@ +# Profile guided optimization (pgo) + +`pgo` is an optimization technique to optimize a program for its usual +workload. + +It is applied in two phases: +1. Collect profiling data (best with representative benchmarks). +1. Optimize program based on collected profiling data. + +The following simple program is used as demonstrator. +```c +#include <stdio.h> + +#define NOINLINE __attribute__((noinline)) + +NOINLINE void foo() { puts("foo()"); } +NOINLINE void bar() { puts("bar()"); } + +int main(int argc, char *argv[]) { + if (argc == 2) { + foo(); + } else { + bar(); + } +} +``` + +## clang + +On the actual machine with `clang 15.0.7`, the following code is generated for +the `main()` function. +```x86asm +# clang -o test test.c -O3 + +0000000000001160 <main>: + 1160: 50 push rax + ; Jump if argc != 2. + 1161: 83 ff 02 cmp edi,0x2 + 1164: 75 09 jne 116f <main+0xf> + ; foor() is on the hot path (fall-through). + 1166: e8 d5 ff ff ff call 1140 <_Z3foov> + 116b: 31 c0 xor eax,eax + 116d: 59 pop rcx + 116e: c3 ret + ; bar() is on the cold path (branch). + 116f: e8 dc ff ff ff call 1150 <_Z3barv> + 1174: 31 c0 xor eax,eax + 1176: 59 pop rcx + 1177: c3 ret +``` + +The following shows how to compile with profiling instrumentation and how to +optimize the final program with the collected profiling data ([llvm +pgo][llvm-pgo]). + +The arguments to `./test` are chosen such that `9/10` runs call `bar()`, which +is currently on the `cold path`. + +```bash +# Compile test program with profiling instrumentation. +clang -o test test.cc -O3 -fprofile-instr-generate + +# Collect profiling data from multiple runs. +for i in {0..10}; do + LLVM_PROFILE_FILE="prof.clang/%p.profraw" ./test $(seq 0 $i) +done + +# Merge raw profiling data into single profile data. +llvm-profdata merge -o pgo.profdata prof.clang/*.profraw + +# Optimize test program with profiling data. +clang -o test test.cc -O3 -fprofile-use=pgo.profdata +``` +> NOTE: If `LLVM_PROFILE_FILE` is not given the profile data is written to +> `default.profraw` which is re-written on each run. If the `LLVM_PROFILE_FILE` +> contains a `%m` in the filename, a unique integer will be generated and +> consecutive runs will update the same generated profraw file, +> `LLVM_PROFILE_FILE` can specify a new file every time, however that requires +> more storage in general. + +After optimizing the program with the profiling data, the `main()` function +looks as follows. +```x86asm +0000000000001060 <main>: + 1060: 50 push rax + ; Jump if argc == 2. + 1061: 83 ff 02 cmp edi,0x2 + 1064: 74 09 je 106f <main+0xf> + ; bar() is on the hot path (fall-through). + 1066: e8 e5 ff ff ff call 1050 <_Z3barv> + 106b: 31 c0 xor eax,eax + 106d: 59 pop rcx + 106e: c3 ret + ; foo() is on the cold path (branch). + 106f: e8 cc ff ff ff call 1040 <_Z3foov> + 1074: 31 c0 xor eax,eax + 1076: 59 pop rcx + 1077: c3 ret +``` + +## gcc + +With `gcc 13.2.1` on the current machine, the optimizer puts `bar()` on the +`hot path` by default. +```x86asm +0000000000001040 <main>: + 1040: 48 83 ec 08 sub rsp,0x8 + ; Jump if argc == 2. + 1044: 83 ff 02 cmp edi,0x2 + 1047: 74 0c je 1055 <main+0x15> + ; bar () is on the hot path (fall-through). + 1049: e8 22 01 00 00 call 1170 <_Z3barv> + 104e: 31 c0 xor eax,eax + 1050: 48 83 c4 08 add rsp,0x8 + 1054: c3 ret + ; foo() is on the cold path (branch). + 1055: e8 06 01 00 00 call 1160 <_Z3foov> + 105a: eb f2 jmp 104e <main+0xe> + 105c: 0f 1f 40 00 nop DWORD PTR [rax+0x0] + +``` + +The following shows how to compile with profiling instrumentation and how to +optimize the final program with the collected profiling data. + +The arguments to `./test` are chosen such that `2/3` runs call `foo()`, which +is currently on the `cold path`. + +```bash +gcc -o test test.cc -O3 -fprofile-generate +./test 1 +./test 1 +./test 2 2 +gcc -o test test.cc -O3 -fprofile-use +``` +> NOTE: Consecutive runs update the generated `test.gcda` profile data file +> rather than re-write it. + +After optimizing the program with the profiling data, the `main()` function +```x86asm +0000000000001040 <main.cold>: + ; bar() is on the cold path (branch). + 1040: e8 05 00 00 00 call 104a <_Z3barv> + 1045: e9 25 00 00 00 jmp 106f <main+0xf> + +0000000000001060 <main>: + 1060: 51 push rcx + ; Jump if argc != 2. + 1061: 83 ff 02 cmp edi,0x2 + 1064: 0f 85 d6 ff ff ff jne 1040 <main.cold> + ; for() is on the hot path (fall-through). + 106a: e8 11 01 00 00 call 1180 <_Z3foov> + 106f: 31 c0 xor eax,eax + 1071: 5a pop rdx + 1072: c3 ret +``` + +[llvm-pgo]: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization |