aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/tools/awk.md
blob: 9ea4fc5e0f461c76c5fa38f33629dc5b7cb827cb (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# awk(1)

```markdown
awk [opt] program [input]
    -F <sepstr>        field separator string (can be regex)
    program            awk program
    input              file or stdin if not file given
```

## Input processing

Input is processed in two stages:
1. Splitting input into a sequence of `records`.
   By default split at `newline` character, but can be changed via the
   builtin `RS` variable.
2. Splitting a `record` into `fields`. By default strings without `whitespace`,
   but can be changed via the builtin variable `FS` or command line option
   `-F`.

Fields are accessed as follows:
- `$0` whole `record`
- `$1` field one
- `$2` field two
- ...

## Program

An `awk` program is composed of pairs of the form:
```markdown
pattern { action }
```
The program is run against each `record` in the input stream. If a `pattern`
matches a `record` the corresponding `action` is executed and can access the
`fields`.

```markdown
INPUT
  |
  v
record ----> ∀ pattern matched
  |                   |
  v                   v
fields ----> run associated action
```

Any valid awk `expr` can be a `pattern`.

### Special pattern

awk provides two special patterns, `BEGIN` and `END`, which can be used
multiple times. Actions with those patterns are **executed exactly once**.
- `BEGIN` actions are run before processing the first record
- `END` actions are run after processing the last record

### Special variables

- `RS` _record separator_: first char is the record separator, by default
  <newline>
- `FS` _field separator_: regex to split records into fields, by default
  <space>
- `NR` _number record_: number of current record
- `NF` _number fields_: number of fields in the current record

### Special statements & functions

- `printf "fmt", args...`

  Print format string, args are comma separated.
  - `%s` string
  - `%d` decimal
  - `%x` hex
  - `%f` float

  Width can be specified as `%Ns`, this reserves `N` chars for a string.
  For floats one can use `%N.Mf`, `N` is the total number including `.` and
  `M`.

- `sprintf("fmt", expr, ...)`

    Format the expressions according to the format string. Similar as `printf`,
    but this is a function and return value can be assigned to a variable.

- `strftime("fmt")`

  Print time stamp formatted by `fmt`.
  - `%Y` full year (eg 2020)
  - `%m` month (01-12)
  - `%d` day (01-31)
  - `%F` alias for `%Y-%m-%d`
  - `%H` hour (00-23)
  - `%M` minute (00-59)
  - `%S` second (00-59)
  - `%T` alias for `%H:%M:%S`


## Examples

### Filter records
```bash
awk 'NR%2 == 0 { print $0 }' <file>
```
The pattern `NR%2 == 0` matches every second record and the action `{ print $0 }`
prints the whole record.

### Access last fields in records
```bash
echo 'a b c d e f' | awk '{ print $NF $(NF-1) }'
```
Access last fields with arithmetic on the `NF` number of fields variable.

### Capture in variables
```bash
# /proc/<pid>/status
#   Name:    cat
#   ...
#   VmRSS:   516 kB
#   ...

for f in /proc/*/status; do
    cat $f | awk '
             /^VmRSS/ { rss = $2/1024 }
             /^Name/ { name = $2 }
             END { printf "%16s %6d MB\n", name, rss }';
done | sort -k2 -n
```
We capture values from `VmRSS` and `Name` into variables and print them at the
`END` once processing all records is done.

### Run shell command and capture output
```bash
cat /proc/1/status | awk '
                     /^Pid/ {
                        "ps --no-header -o user " $2 | getline user;
                         print user
                     }'
```
We build a `ps` command line and capture the first line of the processes output
in the `user` variable and then print it.