src/trace_profile/perf.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235

# perf(1)

```
perf list     show supported hw/sw events & metrics
  -v ........ print longer event descriptions
  --details . print information on the perf event names
              and expressions used internally by events

perf stat
  -p <pid> ..... show stats for running process
  -o <file> .... write output to file (default stderr)
  -I <ms> ...... show stats periodically over interval <ms>
  -e <ev> ...... select event(s)
  -M <met> ..... print metric(s), this adds the metric events
  --all-user ... configure all selected events for user space
  --all-kernel . configure all selected events for kernel space

perf top
  -p <pid> .. show stats for running process
  -F <hz> ... sampling frequency
  -K ........ hide kernel threads

perf record
  -p <pid> ............... record stats for running process
  -o <file> .............. write output to file (default perf.data)
  -F <hz> ................ sampling frequency
  --call-graph <method> .. [fp, dwarf, lbr] method how to caputre backtrace
                           fp   : use frame-pointer, need to compile with
                                  -fno-omit-frame-pointer
                           dwarf: use .cfi debug information
                           lbr  : use hardware last branch record facility
  -g ..................... short-hand for --call-graph fp
  -e <ev> ................ select event(s)
  --all-user ............. configure all selected events for user space
  --all-kernel ........... configure all selected events for kernel space
  -M intel ............... use intel disassembly in annotate

perf report
  -n .................... annotate symbols with nr of samples
  --stdio ............... report to stdio, if not presen tui mode
  -g graph,0.5,callee ... show callee based call chains with value >0.5
```

```
Useful <ev>:
  page-faults
  minor-faults
  major-faults
  cpu-cycles`
  task-clock
```

## Select specific events

Events to sample are specified with the `-e` option, either pass a comma
separated list or pass `-e` multiple times.

Events are specified in the following form `name[:modifier]`. The list and
description of the `modifier` can be found in the
[`perf-list(1)`][man-perf-list] manpage under `EVENT MODIFIERS`.
```sh
# L1 i$ misses in user space
# L2 i$ stats in user/kernel space mixed
# Sample specified events.
perf stat -e L1-icache-load-misses:u \
          -e l2_rqsts.all_code_rd:uk,l2_rqsts.code_rd_hit:k,l2_rqsts.code_rd_miss:k \
          -- stress -c 2
```

The `--all-user` and `--all-kernel` options append a `:u` and `:k` modifier to
all specified events. Therefore the following two command lines are equivalent.
```sh
# 1)
perf stat -e cycles:u,instructions:u -- ls

# 2)
perf stat --all-user -e cycles,instructions -- ls
```

### Raw events

In case perf does not provide a _symbolic_ name for an event, the event can be
specified in a _raw_ form as `r + UMask + EventCode`.

The following is an example for the [L2_RQSTS.CODE_RD_HIT][l2i-req-ev] event
with `EventCode=0x24` and `UMask=0x10` on my laptop with a `sandybridge` uarch.
```sh
perf stat -e l2_rqsts.code_rd_hit -e r1024 -- ls
# Performance counter stats for 'ls':
#
#       33.942      l2_rqsts.code_rd_hit
#       33.942      r1024
```

### Find raw performance counter events (intel)

The [`intel/perfmon`][perfmon] repository provides a performance event
databases for the different intel uarchs.

The table in [`mapfile.csv`][perfmon-map] can be used to lookup the
corresponding uarch, just grab the family model from the procfs.
```sh
 cat /proc/cpuinfo | awk '/^vendor_id/  { V=$3 }
                          /^cpu family/ { F=$4 }
                          /^model\s*:/  { printf "%s-%d-%x\n",V,F,$3 }'
```
> The table in [performance monitoring events][perfmon-kinds] describes how
> events are sorted into the different files.

### Raw events for perfs own symbolic names

Perf also defines some own _symbolic_ names for events. An example is the
`cache-references` event. The [`perf_event_open(2)`][man-perf-ev-open] manpage
gives the following description.
```man
perf_event_open(2)

PERF_COUNT_HW_CACHE_REFERENCES
    Cache accesses.  Usually this indicates Last Level Cache accesses but this
    may vary depending on your CPU.  This may include prefetches and coherency
    messages; again this depends on the design of your CPU.
```

The `sysfs` can be consulted to get the concrete performance counter on the
given system.
```sh
cat /sys/devices/cpu/events/cache-misses
# event=0x2e,umask=0x41
```

## [`Flamegraph`](https://github.com/brendangregg/FlameGraph)

### Flamegraph with single event trace
```
perf record -g -e cpu-cycles -p <pid>
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > cycles-flamegraph.svg
```

### Flamegraph with multiple event traces
```sh
perf record -g -e cpu-cycles,page-faults -p <pid>
perf script --per-event-dump
# fold & generate as above
```

## Examples
### Estimate max instructions per cycle

```c
{{#include src/noploop.c }}
```

```sh
perf stat -e cycles,instructions ./noploop
# Performance counter stats for './noploop':
#
#     1.031.075.940      cycles
#     4.103.534.341      instructions       #    3,98  insn per cycle
```

### Caller vs callee callstacks

The following gives an example for a scenario where we have the following calls
- `main -> do_foo() -> do_work()`
- `main -> do_bar() -> do_work()`

```sh
perf report --stdio -g graph,callee

# Children      Self  Command  Shared Object         Symbols
# ........  ........  .......  ....................  .................
#
#  49.71%    49.66%   bench    bench                 [.] do_work
#          |
#           --49.66%--_start                <- callstack bottom
#                     __libc_start_main
#                     0x7ff366c62ccf
#                     main
#                     |
#                     |--25.13%--do_bar
#                     |          do_work    <- callstack top
#                     |
#                      --24.53%--do_foo
#                                do_work

perf report --stdio -g graph,callee

# Children      Self  Command  Shared Object         Symbols
# ........  ........  .......  ....................  .................
#
#  49.71%    49.66%   bench    bench                 [.] do_work
#          |
#          ---do_work                       <- callstack top
#             |
#             |--25.15%--do_bar
#             |          main
#             |          0x7ff366c62ccf
#             |          __libc_start_main
#             |          _start             <- callstack bottom
#             |
#              --24.55%--do_foo
#                        main
#                        0x7ff366c62ccf
#                        __libc_start_main
#                        _start             <- callstack bottom
```

## References
- [intel/perfmon][perfmon] - intel PMU event database per uarch
- [intel/perfmon-html][perfmon-html] - a html rendered version of the PMU
  events with search
- [intel/perfmon/mapfile.csv][perfmon-map] - processor family to uarch mapping
- [linux/perf/events][perf-pmu-ev] - x86 PMU events known to perf tools
- [linux/arch/events][x86-core-ev] - x86 PMU events linux kernel
- [wikichip] - computer architecture wiki
- [perf-list(1)][man-perf-list] - manpage
- [perf_event_open(2)][man-perf-ev-open] - manpage
- [intel/sdm][intel-sdm] - intel software developer manuals (eg Optimization
  Reference Manual)


[perfmon-html]: https://perfmon-events.intel.com/
[perfmon]: https://github.com/intel/perfmon
[perfmon-map]: https://github.com/intel/perfmon/blob/main/mapfile.csv
[perfmon-kinds]: https://github.com/intel/perfmon/tree/main#performance-monitoring-events
[intel-sdm]: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

[perf-pmu-ev]: https://github.com/torvalds/linux/tree/master/tools/perf/pmu-events/arch/x86
[x86-core-ev]: https://github.com/torvalds/linux/blob/master/arch/x86/events/intel/core.c
[l2i-req-ev]: https://github.com/intel/perfmon/blob/09c155f72e1b8f14b09aea346a35467a03a7d62b/SNB/events/sandybridge_core.json#L808

[man-perf-ev-open]: https://man7.org/linux/man-pages/man2/perf_event_open.2.html
[man-perf-list]: https://man7.org/linux/man-pages/man1/perf-list.1.html

[wikichip]: https://en.wikichip.org/wiki/WikiChip