aboutsummaryrefslogtreecommitdiffhomepage
path: root/src
diff options
context:
space:
mode:
Diffstat (limited to 'src')
-rw-r--r--src/SUMMARY.md1
-rw-r--r--src/development/README.md1
-rw-r--r--src/development/pgo.md158
3 files changed, 160 insertions, 0 deletions
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
index fb22e69..b6da0b7 100644
--- a/src/SUMMARY.md
+++ b/src/SUMMARY.md
@@ -50,6 +50,7 @@
- [symbol versioning](./development/symbolver.md)
- [python](./development/python.md)
- [gcov](./development/gcov.md)
+ - [pgo](./development/pgo.md)
- [Linux](./linux/README.md)
- [systemd](./linux/systemd.md)
diff --git a/src/development/README.md b/src/development/README.md
index dfea5cd..117ef17 100644
--- a/src/development/README.md
+++ b/src/development/README.md
@@ -9,3 +9,4 @@
- [symbol versioning](./symbolver.md)
- [python](./python.md)
- [gcov](./gcov.md)
+- [pgo](./pgo.md)
diff --git a/src/development/pgo.md b/src/development/pgo.md
new file mode 100644
index 0000000..d1b3eb7
--- /dev/null
+++ b/src/development/pgo.md
@@ -0,0 +1,158 @@
+# Profile guided optimization (pgo)
+
+`pgo` is an optimization technique to optimize a program for its usual
+workload.
+
+It is applied in two phases:
+1. Collect profiling data (best with representative benchmarks).
+1. Optimize program based on collected profiling data.
+
+The following simple program is used as demonstrator.
+```c
+#include <stdio.h>
+
+#define NOINLINE __attribute__((noinline))
+
+NOINLINE void foo() { puts("foo()"); }
+NOINLINE void bar() { puts("bar()"); }
+
+int main(int argc, char *argv[]) {
+ if (argc == 2) {
+ foo();
+ } else {
+ bar();
+ }
+}
+```
+
+## clang
+
+On the actual machine with `clang 15.0.7`, the following code is generated for
+the `main()` function.
+```x86asm
+# clang -o test test.c -O3
+
+0000000000001160 <main>:
+ 1160: 50 push rax
+ ; Jump if argc != 2.
+ 1161: 83 ff 02 cmp edi,0x2
+ 1164: 75 09 jne 116f <main+0xf>
+ ; foor() is on the hot path (fall-through).
+ 1166: e8 d5 ff ff ff call 1140 <_Z3foov>
+ 116b: 31 c0 xor eax,eax
+ 116d: 59 pop rcx
+ 116e: c3 ret
+ ; bar() is on the cold path (branch).
+ 116f: e8 dc ff ff ff call 1150 <_Z3barv>
+ 1174: 31 c0 xor eax,eax
+ 1176: 59 pop rcx
+ 1177: c3 ret
+```
+
+The following shows how to compile with profiling instrumentation and how to
+optimize the final program with the collected profiling data ([llvm
+pgo][llvm-pgo]).
+
+The arguments to `./test` are chosen such that `9/10` runs call `bar()`, which
+is currently on the `cold path`.
+
+```bash
+# Compile test program with profiling instrumentation.
+clang -o test test.cc -O3 -fprofile-instr-generate
+
+# Collect profiling data from multiple runs.
+for i in {0..10}; do
+ LLVM_PROFILE_FILE="prof.clang/%p.profraw" ./test $(seq 0 $i)
+done
+
+# Merge raw profiling data into single profile data.
+llvm-profdata merge -o pgo.profdata prof.clang/*.profraw
+
+# Optimize test program with profiling data.
+clang -o test test.cc -O3 -fprofile-use=pgo.profdata
+```
+> NOTE: If `LLVM_PROFILE_FILE` is not given the profile data is written to
+> `default.profraw` which is re-written on each run. If the `LLVM_PROFILE_FILE`
+> contains a `%m` in the filename, a unique integer will be generated and
+> consecutive runs will update the same generated profraw file,
+> `LLVM_PROFILE_FILE` can specify a new file every time, however that requires
+> more storage in general.
+
+After optimizing the program with the profiling data, the `main()` function
+looks as follows.
+```x86asm
+0000000000001060 <main>:
+ 1060: 50 push rax
+ ; Jump if argc == 2.
+ 1061: 83 ff 02 cmp edi,0x2
+ 1064: 74 09 je 106f <main+0xf>
+ ; bar() is on the hot path (fall-through).
+ 1066: e8 e5 ff ff ff call 1050 <_Z3barv>
+ 106b: 31 c0 xor eax,eax
+ 106d: 59 pop rcx
+ 106e: c3 ret
+ ; foo() is on the cold path (branch).
+ 106f: e8 cc ff ff ff call 1040 <_Z3foov>
+ 1074: 31 c0 xor eax,eax
+ 1076: 59 pop rcx
+ 1077: c3 ret
+```
+
+## gcc
+
+With `gcc 13.2.1` on the current machine, the optimizer puts `bar()` on the
+`hot path` by default.
+```x86asm
+0000000000001040 <main>:
+ 1040: 48 83 ec 08 sub rsp,0x8
+ ; Jump if argc == 2.
+ 1044: 83 ff 02 cmp edi,0x2
+ 1047: 74 0c je 1055 <main+0x15>
+ ; bar () is on the hot path (fall-through).
+ 1049: e8 22 01 00 00 call 1170 <_Z3barv>
+ 104e: 31 c0 xor eax,eax
+ 1050: 48 83 c4 08 add rsp,0x8
+ 1054: c3 ret
+ ; foo() is on the cold path (branch).
+ 1055: e8 06 01 00 00 call 1160 <_Z3foov>
+ 105a: eb f2 jmp 104e <main+0xe>
+ 105c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
+
+```
+
+The following shows how to compile with profiling instrumentation and how to
+optimize the final program with the collected profiling data.
+
+The arguments to `./test` are chosen such that `2/3` runs call `foo()`, which
+is currently on the `cold path`.
+
+```bash
+gcc -o test test.cc -O3 -fprofile-generate
+./test 1
+./test 1
+./test 2 2
+gcc -o test test.cc -O3 -fprofile-use
+```
+> NOTE: Consecutive runs update the generated `test.gcda` profile data file
+> rather than re-write it.
+
+After optimizing the program with the profiling data, the `main()` function
+```x86asm
+0000000000001040 <main.cold>:
+ ; bar() is on the cold path (branch).
+ 1040: e8 05 00 00 00 call 104a <_Z3barv>
+ 1045: e9 25 00 00 00 jmp 106f <main+0xf>
+
+0000000000001060 <main>:
+ 1060: 51 push rcx
+ ; Jump if argc != 2.
+ 1061: 83 ff 02 cmp edi,0x2
+ 1064: 0f 85 d6 ff ff ff jne 1040 <main.cold>
+ ; for() is on the hot path (fall-through).
+ 106a: e8 11 01 00 00 call 1180 <_Z3foov>
+ 106f: 31 c0 xor eax,eax
+ 1071: 5a pop rdx
+ 1072: c3 ret
+```
+
+[llvm-pgo]: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization