1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
|
# Profile guided optimization (pgo)
`pgo` is an optimization technique to optimize a program for its usual
workload.
It is applied in two phases:
1. Collect profiling data (best with representative benchmarks).
1. Optimize program based on collected profiling data.
The following simple program is used as demonstrator.
```c
#include <stdio.h>
#define NOINLINE __attribute__((noinline))
NOINLINE void foo() { puts("foo()"); }
NOINLINE void bar() { puts("bar()"); }
int main(int argc, char *argv[]) {
if (argc == 2) {
foo();
} else {
bar();
}
}
```
## clang
On the actual machine with `clang 15.0.7`, the following code is generated for
the `main()` function.
```x86asm
# clang -o test test.c -O3
0000000000001160 <main>:
1160: 50 push rax
; Jump if argc != 2.
1161: 83 ff 02 cmp edi,0x2
1164: 75 09 jne 116f <main+0xf>
; foor() is on the hot path (fall-through).
1166: e8 d5 ff ff ff call 1140 <_Z3foov>
116b: 31 c0 xor eax,eax
116d: 59 pop rcx
116e: c3 ret
; bar() is on the cold path (branch).
116f: e8 dc ff ff ff call 1150 <_Z3barv>
1174: 31 c0 xor eax,eax
1176: 59 pop rcx
1177: c3 ret
```
The following shows how to compile with profiling instrumentation and how to
optimize the final program with the collected profiling data ([llvm
pgo][llvm-pgo]).
The arguments to `./test` are chosen such that `9/10` runs call `bar()`, which
is currently on the `cold path`.
```bash
# Compile test program with profiling instrumentation.
clang -o test test.cc -O3 -fprofile-instr-generate
# Collect profiling data from multiple runs.
for i in {0..10}; do
LLVM_PROFILE_FILE="prof.clang/%p.profraw" ./test $(seq 0 $i)
done
# Merge raw profiling data into single profile data.
llvm-profdata merge -o pgo.profdata prof.clang/*.profraw
# Optimize test program with profiling data.
clang -o test test.cc -O3 -fprofile-use=pgo.profdata
```
> NOTE: If `LLVM_PROFILE_FILE` is not given the profile data is written to
> `default.profraw` which is re-written on each run. If the `LLVM_PROFILE_FILE`
> contains a `%m` in the filename, a unique integer will be generated and
> consecutive runs will update the same generated profraw file,
> `LLVM_PROFILE_FILE` can specify a new file every time, however that requires
> more storage in general.
After optimizing the program with the profiling data, the `main()` function
looks as follows.
```x86asm
0000000000001060 <main>:
1060: 50 push rax
; Jump if argc == 2.
1061: 83 ff 02 cmp edi,0x2
1064: 74 09 je 106f <main+0xf>
; bar() is on the hot path (fall-through).
1066: e8 e5 ff ff ff call 1050 <_Z3barv>
106b: 31 c0 xor eax,eax
106d: 59 pop rcx
106e: c3 ret
; foo() is on the cold path (branch).
106f: e8 cc ff ff ff call 1040 <_Z3foov>
1074: 31 c0 xor eax,eax
1076: 59 pop rcx
1077: c3 ret
```
## gcc
With `gcc 13.2.1` on the current machine, the optimizer puts `bar()` on the
`hot path` by default.
```x86asm
0000000000001040 <main>:
1040: 48 83 ec 08 sub rsp,0x8
; Jump if argc == 2.
1044: 83 ff 02 cmp edi,0x2
1047: 74 0c je 1055 <main+0x15>
; bar () is on the hot path (fall-through).
1049: e8 22 01 00 00 call 1170 <_Z3barv>
104e: 31 c0 xor eax,eax
1050: 48 83 c4 08 add rsp,0x8
1054: c3 ret
; foo() is on the cold path (branch).
1055: e8 06 01 00 00 call 1160 <_Z3foov>
105a: eb f2 jmp 104e <main+0xe>
105c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
```
The following shows how to compile with profiling instrumentation and how to
optimize the final program with the collected profiling data.
The arguments to `./test` are chosen such that `2/3` runs call `foo()`, which
is currently on the `cold path`.
```bash
gcc -o test test.cc -O3 -fprofile-generate
./test 1
./test 1
./test 2 2
gcc -o test test.cc -O3 -fprofile-use
```
> NOTE: Consecutive runs update the generated `test.gcda` profile data file
> rather than re-write it.
After optimizing the program with the profiling data, the `main()` function
```x86asm
0000000000001040 <main.cold>:
; bar() is on the cold path (branch).
1040: e8 05 00 00 00 call 104a <_Z3barv>
1045: e9 25 00 00 00 jmp 106f <main+0xf>
0000000000001060 <main>:
1060: 51 push rcx
; Jump if argc != 2.
1061: 83 ff 02 cmp edi,0x2
1064: 0f 85 d6 ff ff ff jne 1040 <main.cold>
; for() is on the hot path (fall-through).
106a: e8 11 01 00 00 call 1180 <_Z3foov>
106f: 31 c0 xor eax,eax
1071: 5a pop rdx
1072: c3 ret
```
[llvm-pgo]: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
|