content/2024-12-20-xpost-juicebox-asm/index.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352

+++
title = "xpost: x64 jit assembler (juicebox-asm)"

[taxonomies]
tags = ["x86", "rust", "linux"]
+++

This is a cross post to a repository implementing an `x64` _jit assembler_,
which I used to study the x86 instruction encoding and how to write such a tool
in rust. The code is available on github [>> juicebox-asm <<][jb].

The _assembler_ only implements a small set of the x86 instruction set, but
enough to create some interesting examples. Additionally, the repository
provides a simple _runtime_ based on [mmap(2)][mmap] (linux only), which is
used to execute the dynamically assembled code and the provided examples.

## Assembler
The _assembler_ implements all the supported instructions and provides a similar
interface as when writing assembly code by hand.

```rust
let mut asm = Asm::new();
asm.xor(rax, rax);
asm.add(rax, rdi);
asm.add(rax, Mem64::indirect_base_index(rdi, rsi));
asm.ret();

asm.disasm();
// Outputs:
//   00000000  4831C0    xor rax,rax
//   00000003  4801F8    add rax,rdi
//   00000006  48030437  add rax,[rdi+rsi]
//   00000006  C3        ret
```

An interesting design-choice is given by the implementation of the assembler
API for the various instructions, since x86 overloads instruction mnemonics
with many operand types. This can be seen in the code listing above where the
_add_ instruction is used with both `register-register` and `register-memory`
operands. It becomes even clearer when looking into the instruction reference
manual, in the example of the _add_ instruction.

<img src="x86-add.png">

> Table taken from [Intel® 64 and IA-32 Architectures Software Developer’s
> Manual - Volume 2 Instruction Set Reference][intel-im].

One approach is to implement the functions with _explicit_ types for the
operands. This has the benefit that the API clearly expresses which operands
are allowed for a given instruction and the compiler can check if wrong
operands are provided. However, since rust does not have function overloading
<sup id="sup-1-src"><a href="#sup-1-dst">1</a></sup>
as in cpp, one does need to define functions with different names to
implement the instructions. The following shows an example for the _add_
instruction with different operands.
```rust
// Pseudo code.
fn add_r64_r64(Reg64, Reg64)
fn add_r64_m64(Reg64, Mem64)
// ..
```

Another approach is to provided a _flexible_ `Operand` type and implement the
instructions with that. The benefit here is that there is only **one** _add_
function. The downside however, is that the API does not clearly express which
operands are allowed and the operand check must be done at _runtime_ rather
than _compile-time_.
```rust
// Pseudo code.
fn add(Operand, Operand)
```

During this experiment I have learned that there is a way in rust to emulate
sort of function overloading, which allows to get all the benefits of the above
discussed approaches while getting rid of all the downsides at the same time at
the expense of a little boiler-plate code.

The idea is to define generic traits for the various instructions with
overloaded mnemonics. The listing below shows the `Add` trait as example for
the _add_ instruction.

```rust
pub trait Add<T, U> {
    fn add(&mut self, op1: T, op2: U);
}
```

This in turn allows to implement the traits for all the different combinations
of (explicit) operands that should be supported by each instruction. It gives
fine control of what to support and provides an ergonomic API as seen in the
first example code listing.

```rust
impl Add<Reg64, Reg64> for Asm {
    fn add(&mut self, op1: Reg64, op2: Reg64) {
        self.encode_rr(&[0x01], op1, op2);
    }
}

impl Add<Reg64, Mem64> for Asm {
    fn add(&mut self, op1: Reg64, op2: Mem64) {
        self.encode_rm(0x03, op1, op2);
    }
}
```

All instructions currently implemented with their variations can be found in the
documentation of the [Asm][jb-asm] _assembler_.

## Runtime
With the help of the provided runtime, code assembled with the _assembler_ can
be put into executable memory and turned into a function pointer, allowing to
call into the generated code.

```rust
extern "C" fn add(a: u32, b: u32) -> u32 {
    a + b
}

fn main() {
    // SystemV abi:
    //   rdi -> first argument
    //   rsi -> second argument
    //   rax -> return value

    let mut asm = Asm::new();
    asm.mov(rsi, Imm64::from(42));
    asm.mov(rax, Imm64::from(add as usize));
    asm.call(rax);
    asm.ret();

    let mut rt = Runtime::new();
    let add42 = unsafe { rt.add_code::<extern "C" fn(u32) -> u32>(asm.into_code()) };

    let res = add42(5);
    assert_eq!(res, 47);
}
```
> This example must be run on a system with the [SystemV][sysv] ABI.

## Label
A label is used to associate a location in the generated code with a symbolic
name (the label), which can then be used as branch target for control-flow
instructions as for example `jz`, `jnz`.

The act of associating a label with a location is called _binding_. Each label
can only be bound **once**, else the branch target would become _ambiguous_.

Binding and jumping to a label can come in two orders.
1. A label is first bound and then jumped to (backwards-reference).
1. A label is first jumped to and then bound (forwards-reference).

In the following both cases will be discussed in a bit more detail, which
hopefully helps when browsing the implementation. One thing to keep in mind,
since the generated code must be relocatable, all jump instructions will be
encoded relative to the current program counter (pc).

The first case where the label is first bound and later used, is somewhat more
natural to think about, as first defining then using something is pretty
clear. The code listing below shows an example of a loop with a back-reference
summing up all integer between 0 and 42.

```rust,linenos
let mut asm = Asm::new();
let mut lp  = Label::new();

asm.mov(eax, Imm32::from(0));   // int ret = 0;
asm.mov(ecx, Imm32::from(42));  // int n   = 42;

asm.bind(&mut lp);              // lp:
asm.add(eax, ecx);              //   ret += n;
asm.dec(ecx);                   //   --n;
asm.test(ecx, ecx);             //   if (n != 0)
asm.jnz(&mut lp);               //     goto lp

asm.ret();
```

From the point of view of the code buffer this looks as follows. When the label
is bound (line 7), the current offset into the code buffer is recorded as the
labels _location_. Then when emitting the jump instruction (line 11) later, this
location is used with the current code buffers offset to compute the
_displacement_ (disp32) forming the relative jump.

<div style="margin:auto; width:85%;">
<img src="label-lj.svg">
</div>

The second case where the label is first used and then bound later, is not
really more complex, but one has to reverse the steps. The code listing below
gives an example, demonstrating forwards jumps by validating two input
arguments to be not equal to 0 and returning of one check fails.

```rust,linenos
let mut asm = Asm::new();
let mut ret = Label::new();

// Validate inputs (sysv abi).
asm.test(rdi, rdi);     // if (rdi == 0)
asm.jz(&mut ret);       //   return;
asm.test(rsi, rsi);     // if (rsi == 0)
asm.jz(&mut ret);       //   return;

// Here, rdi != 0 && rsi != 0.
// nop as placeholder for some real work.
asm.nop();

asm.bind(&mut ret);
asm.ret();
```

Again from the point of view of the code buffer this looks as follows. When the
jump instructions are emitted (line 6, 8), the _displacement_ can not be
computed, as the label has no defined location yet. A placeholder displacement
of value 0 is emitted into the code buffer and the offset to the displacement
value is recorded by the label as pending _relocation_. Once the label is bound
to a location, all pending relocations are resolved by consuming the recorded
offsets, computing the corresponding displacements and patching the values in
the code buffer.

<div style="margin:auto; width:85%;">
<img src="label-jl.svg">
</div>

## Example: bf

The [bf.rs][jb-bf] example implements a _jit compiler_ for the [brainfuck][bf]
language. In this example the _bf program_ is jit compiled to one code blob as a
whole and evaluated then after. The example only returns from the jit code if
the bf program terminates or a runtime error is detected. For reference an
_interpreter_ is also provided.

## Example: tiny-vm

The [tiny_vm.rs][jb-tiny] example defines a _tiny virtual machine (TinyVm)_ with
a set of _registers_ and custom _instructions_. It then goes on and implements a
_jit compiler_ for the TinyVm. In comparison to the jit compiler in the previous
bf example, the TinyVm compiler jit compiles the program per basic-block (bb),
each time a new bb is encountered. \
After executing a jit compiled basic block, the code returns into the
TinyVm. The vm then checks if the branch target has an entry in the _jit cache_,
which maps guest addresses to compiled basic-blocks. If the cache returns a hit
the TinyVm jumps to the generated code, if the cache returns a miss the new bb is
compiled and inserted into the jit cache.

Similar to the bf example, this example also provides an _interpreter_ as
reference.

## Appendix: x86 instruction encoding
x86 instructions are encoded with _variable length_ and follow the format below.
A single instruction can be as small as 1 byte up to a double-digit number of
bytes.

<img src="x86-insn-fmt.png">

> Figure taken from [Intel® 64 and IA-32 Architectures Software Developer’s
> Manual - Volume 2 Instruction Set Reference][intel-im].

_Prefixes_ are typically modifiers for the instruction, where some examples are
`LOCK` for bus locking, `REP` for instruction repetition and `REX` for
specifying extended operand sizes and to access extended registers (like
r8-r15).

The `ModR/M` and `SIB` bytes are used to encode the register and memory operands
of the instruction in the general case. Whether the `SIB` byte is used depends
on the _addressing mode_.

In case the instruction also requires an _address displacement_ or an _immediate
value_, the bytes of the value directly follow the instruction bytes.

Historically, the x86 ISA only had 8 general-purpose registers (ax, cx, dx, bx,
sp, bp, si, di). This is still visible today in various fields which encode
register operands, where one such example is the **ModRM.Reg** field. Later
on, 8 additional general-purpose registers (r8-r15) were added and one
additional bit was needed to encode all 16 registers, while keeping backwards
compatibility. \
As mentioned above, the **REX** prefix is used to access the _extended_
registers. This prefix contains the **R, X, B** bits, which are combined with
the various register operand fields, as shown by the figures below for the case
of the ModRM byte and the ModRM + SIB bytes.

<div style="margin:auto; width:85%;">
<img src="x86-insn-modrm.png">
<img src="x86-insn-modrm-sib.png">
</div>

> Figures taken from [Intel® 64 and IA-32 Architectures Software Developer’s
> Manual - Volume 2 Instruction Set Reference][intel-im].

The following code listing provides an example for two load instructions with
different _addressing modes_ and gives a detailed breakdown of the instruction
bytes for each instruction.

```rust
let mut asm = Asm::new();
asm.mov(r8, Mem64::indirect_disp(r9, 0xaabb));
asm.mov(r8, Mem64::indirect_base_index(r9, r10));

asm.disasm();
// Outputs:
//   4D8B81BBAA0000    mov r8,[r9+0xaabb]
//   4F8B0411          mov r8,[r9+r10]


// Breakdown of the indirect + displacement load.
//
//   4D 8B 81 BBAA0000    mov r8,[r9+0xaabb]
//   ^  ^  ^  `Imm32
//   |  |  `ModRM: mod=10,reg=000,rm=001
//   |  `Opc: mov
//   `REX: W=1,R=1,X=0,B=1
//
//   REX.R + ModRM.reg = 1000 (r8) - dest reg
//   REX.B + ModRm.rm  = 1001 (r9) - base reg


// Breakdown of the base + index load.
//
//   4F 8B 04 11          mov r8,[r9+r10]
//   ^  ^  ^  `SIB: scale=00,index=010,base=001
//   |  |  `ModRM: mod=00,reg=000,rm=100
//   |  `Opc: mov
//   `REX: W=1,R=1,X=1,B=1
//
//  ModRM.rm = 100 -> use SIB byte
//
//   REX.R + ModRM.reg = 1000 ( r8) - dest reg
//   REX.X + SIB.index = 1010 (r10) - index reg
//   REX.B + SIB.base  = 1001 ( r9) - base reg
```

## References
- [juicebox-asm][jb]
- [osdev: x86 instruction encoding](https://wiki.osdev.org/X86-64_Instruction_Encoding)
- [intel instruction set reference][intel-im]

## Footnotes
<span id="sup-1-dst">**1)**</span>
In cpp functions with the _same_ name but _different_ arguments can be
defined. During compilation [overload resolution][cpp-or] takes place and picks
the "right" function to call on each call site.
<a href="#sup-1-src">&#x21A9;</a>

[jb]: https://github.com/johannst/juicebox-asm
[jb-asm]: https://johannst.github.io/juicebox-asm/juicebox_asm/struct.Asm.html
[jb-bf]: https://github.com/johannst/juicebox-asm/blob/main/examples/bf.rs
[jb-tiny]: https://github.com/johannst/juicebox-asm/blob/main/examples/tiny_vm.rs
[mmap]: https://www.man7.org/linux/man-pages/man2/mmap.2.html
[intel-im]: https://cdrdv2.intel.com/v1/dl/getContent/671110
[cpp-or]: https://en.cppreference.com/w/cpp/language/overload_resolution
[sysv]: https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build
[bf]: https://brainfuck.org/brainfuck.html