path: root/content/2024-12-20-xpost-juicebox-asm/index.md



+++
title = "xpost: x64 jit assembler (juicebox-asm)"

[taxonomies]
tags = ["x86", "rust", "linux"]
+++

This is a cross post to a repository implementing an `x64` _jit assembler_,
which I used to study the x86 instruction encoding and how to write such a tool
in rust. The code is available on github [>> juicebox-asm <<][jb].

The _assembler_ only implements a small set of the x86 instruction set, but
enough to create some interesting examples. Additionally, the repository
provides a simple _runtime_ based on [mmap(2)][mmap] (only on linux), which is
used to execute the dynamically assembled code and provided examples.

## Assembler
The _assembler_ implements all the supported instructions and provides a similar
interface as when writing assembly code by hand.

```rust
let mut asm = Asm::new();
asm.xor(rax, rax);
asm.add(rax, rdi);
asm.add(rax, Mem64::indirect_base_index(rdi, rsi));
asm.ret();

asm.disasm();
// Outputs:
//   00000000  4831C0    xor rax,rax
//   00000003  4801F8    add rax,rdi
//   00000006  48030437  add rax,[rdi+rsi]
//   00000006  C3        ret
```

An interesting design-choice is given by the implementation of the assembler API
for the various instructions. Since x86 is a _CISC_ instruction set, it allows
same the mnemonic to be used with different types and permutations of
operands. This can be seen in the code listing above where the _add_ instruction
is used with both `register-register` and `register-memory` operands. It becomes
even clearer when looking into the instruction reference manual, in the example
of the _add_ instruction.

<img src="x86-add.png">

> Table taken from [Intel® 64 and IA-32 Architectures Software Developer’s
> Manual - Volume 2 Instruction Set Reference][intel-im].

Since rust does not have function overloading
<sup id="sup-1-src"><a href="#sup-1-dst">1</a></sup>
as in cpp, one does need to define functions with different function names to
implement the _add_ instruction for different combinations of operands.
```rust
// Pseudo code.
fn add_r64_r64(Reg64, Reg64)
fn add_r64_m64(Reg64, Mem64)
// ..
```

However, in this experiment, I did not go with that approach. Instead I used a
different solution and defined generic traits for the various instructions which
have different operand combinations. The _Add_ trait below gives an example of
that.

```rust
pub trait Add<T, U> {
    fn add(&mut self, op1: T, op2: U);
}
```

This in turn allows to implement the traits for all the different combinations
of operands that should be supported by each instruction. It gives fine control
of what to support and provides an ergonomic API as seen in the first example
code listing.

```rust
impl Add<Reg64, Reg64> for Asm {
    fn add(&mut self, op1: Reg64, op2: Reg64) {
        self.encode_rr(&[0x01], op1, op2);
    }
}

impl Add<Reg64, Mem64> for Asm {
    fn add(&mut self, op1: Reg64, op2: Mem64) {
        self.encode_rm(0x03, op1, op2);
    }
}
```

All instructions currently implemented with their variations can be found in the
documentation of the [Asm][jb-asm] _assembler_.

## Runtime
With the help of the provided runtime, code assembled with the _assembler_ can
be put into executable memory and turned into a function pointer, allowing to
call into the generated code.

```rust
extern "C" fn add(a: u32, b: u32) -> u32 {
    a + b
}

fn main() {
    // SystemV abi:
    //   rdi -> first argument
    //   rsi -> second argument
    //   rax -> return value

    let mut asm = Asm::new();
    asm.mov(rsi, Imm64::from(42));
    asm.mov(rax, Imm64::from(add as usize));
    asm.call(rax);
    asm.ret();

    let mut rt = Runtime::new();
    let add42 = unsafe { rt.add_code::<extern "C" fn(u32) -> u32>(asm.into_code()) };

    let res = add42(5);
    assert_eq!(res, 47);
}
```
> This example must be run on a system with the [SystemV][sysv] ABI.

## Label
A label is used to associate a location in the generated code with a symbolic
name (the label), which can then be used as branch target for control-flow
instructions as for example `jz`, `jnz` and so on.

The act of associating a label with a location is called _binding_. Each label
can only be bound **once**, else the branch target would become _ambiguous_.

Binding and jumping to a label can come in two orders.
1. A label is first bound and then jumped to (backwards-reference).
1. A label is first jumped to and then bound (forwards-reference).

In the following both cases will be discussed in a bit more detail, which
hopefully helps when browsing the implementation. One thing to keep in mind,
since the generated code must be relocatable, all jump instructions will be
encoded relative to the current program counter (pc).

The first case where the label is first bound and later used, is somewhat more
natural to think about, as first defining then using something is pretty
clear. The code listing below shows an example of a loop with a back-reference
summing up all integer between 0 and 42.

```rust,linenos
let mut asm = Asm::new();
let mut lp  = Label::new();

asm.mov(eax, Imm32::from(0));   // int ret = 0;
asm.mov(ecx, Imm32::from(42));  // int n   = 42;

asm.bind(&mut lp);              // lp:
asm.add(eax, ecx);              //   ret += n;
asm.dec(ecx);                   //   --n;
asm.test(ecx, ecx);             //   if (n != 0)
asm.jnz(&mut lp);               //     goto lp

asm.ret();
```

From the point of view of the code buffer this looks as follows. When the label
is bound (line 7), the current offset into the code buffer is recorded as the
labels _location_. Then when emitting the jump instruction (line 11) later, this
location is used with the current code buffers offset to compute the
_displacement_ (disp32) forming the relative jump.

<div style="margin:auto; width:85%;">
<img src="label-lj.svg">
</div>

The second case where the label is first used and then bound later, is not
really more complex, but one has to reverse the steps. The code listing below
gives an example, demonstrating forwards jumps by validating two input
arguments to be not equal to 0 and returning of one check fails.

```rust,linenos
let mut asm = Asm::new();
let mut ret = Label::new();

// Validate inputs (sysv abi).
asm.test(rdi, rdi);     // if (rdi == 0)
asm.jz(&mut ret);       //   return;
asm.test(rsi, rsi);     // if (rsi == 0)
asm.jz(&mut ret);       //   return;

// Here, rdi != 0 && rsi != 0.
// nop as placeholder for some real work.
asm.nop();

asm.bind(&mut ret);
asm.ret();
```

Again from the point of view of the code buffer this looks as follows. When the
jump instructions are emitted (line 6, 8), the _displacement_ can not be
computed, as the label has no defined location yet. A placeholder displacement
of value 0 is emitted into the code buffer and the offset to the displacement
value is recorded by the label as pending _relocation_. Once the label is bound
to a location, all pending relocations are resolved by consuming the recorded
offsets, computing the corresponding displacements and patching the values in
the code buffer.

<div style="margin:auto; width:85%;">
<img src="label-jl.svg">
</div>

## Example: bf

The [bf.rs][jb-bf] example implements a _jit compiler_ for the [brainfuck][bf]
language. In this example the _bf program_ is jit compiled to one code blob as a
whole and evaluated then after. The example only returns from the jit code if
the bf program terminates or a runtime error is detected. For reference an
_interpreter_ is also provided.

## Example: tiny-vm

The [tiny_vm.rs][jb-tiny] example defines a _tiny virtual machine (TinyVm)_ with
a set of _registers_ and custom _instructions_. It then goes on and implements a
_jit compiler_ for the TinyVm. In comparison to the jit compiler in the previous
bf example, the TinyVm compiler jit compiles the program per basic-block (bb),
each time a new bb is encountered. \
After executing a jit compiled basic block, the code returns into the
TinyVm. The vm then checks if the branch target has an entry in the _jit cache_,
which maps guest addresses to compiled basic-blocks. If the cache returns a hit
the TinyVm jumps to the generated code, if the cache returns a miss the new bb is
compiled and inserted into the jit cache.

Similar to the bf example, this example also provides an _interpreter_ as
reference.

## Appendix: x86 instruction encoding
x86 instructions are encoded with _variable length_ and follow the format below.
A single instruction can be as small as 1 byte up to a double-digit number of
bytes.

<img src="x86-insn-fmt.png">

> Figure taken from [Intel® 64 and IA-32 Architectures Software Developer’s
> Manual - Volume 2 Instruction Set Reference][intel-im].

_Prefixes_ are typically modifiers for the instruction, where some examples are
`LOCK` for bus locking, `REP` for instruction repetition and `REX` for
specifying extended operand sizes and to access extended registers (like
r8-r15).

The `ModR/M` and `SIB` bytes are used to encode the register and memory operands
of the instruction in the general case. Whether the `SIB` byte is used depends
on the _addressing mode_.

In case the instruction also requires an _address displacement_ or an _immediate
value_, the bytes of the value directly follow the instruction bytes.

Historically, the x86 ISA only had 8 general-purpose registers (ax, cx, dx, bx,
sp, bp, si, di). This is still visible today in various fields which encode
register operands, where one such example is the **ModRM.Reg** field. Later
on, 8 additional general-purpose registers (r8-r15) were added and one
additional bit was needed to encode all 16 registers, while keeping backwards
compatibility. \
As mentioned above, the **REX** prefix is used to access the _extended_
registers. This prefix contains the **R, X, B** bits, which are combined with
the various register operand fields, as shown by the figures below for the case
of the ModRM byte and the ModRM + SIB bytes.

<div style="margin:auto; width:85%;">
<img src="x86-insn-modrm.png">
<img src="x86-insn-modrm-sib.png">
</div>

> Figures taken from [Intel® 64 and IA-32 Architectures Software Developer’s
> Manual - Volume 2 Instruction Set Reference][intel-im].

The following code listing provides an example for two load instructions with
different _addressing modes_ and gives a detailed breakdown of the instruction
bytes for each instruction.

```rust
let mut asm = Asm::new();
asm.mov(r8, Mem64::indirect_disp(r9, 0xaabb));
asm.mov(r8, Mem64::indirect_base_index(r9, r10));

asm.disasm();
// Outputs:
//   4D8B81BBAA0000    mov r8,[r9+0xaabb]
//   4F8B0411          mov r8,[r9+r10]


// Breakdown of the indirect + displacement load.
//
//   4D 8B 81 BBAA0000    mov r8,[r9+0xaabb]
//   ^  ^  ^  `Imm32
//   |  |  `ModRM: mod=10,reg=000,rm=001
//   |  `Opc: mov
//   `REX: W=1,R=1,X=0,B=1
//
//   REX.R + ModRM.reg = 1000 (r8) - dest reg
//   REX.B + ModRm.rm  = 1001 (r9) - base reg


// Breakdown of the base + index load.
//
//   4F 8B 04 11          mov r8,[r9+r10]
//   ^  ^  ^  `SIB: scale=00,index=010,base=001
//   |  |  `ModRM: mod=00,reg=000,rm=100
//   |  `Opc: mov
//   `REX: W=1,R=1,X=1,B=1
//
//  ModRM.rm = 100 -> use SIB byte
//
//   REX.R + ModRM.reg = 1000 ( r8) - dest reg
//   REX.X + SIB.index = 1010 (r10) - index reg
//   REX.B + SIB.base  = 1001 ( r9) - base reg
```

## References
- [juicebox-asm][jb]
- [osdev: x86 instruction encoding](https://wiki.osdev.org/X86-64_Instruction_Encoding)
- [intel instruction set reference][intel-im]

## Footnotes
<span id="sup-1-dst">**1)**</span>
In cpp functions with the _same_ name but _different_ arguments can be
defined. During compilation [overload resolution][cpp-or] takes place and picks
the "right" function to call on each call site.
<a href="#sup-1-src">&#x21A9;</a>

[jb]: https://github.com/johannst/juicebox-asm
[jb-asm]: https://johannst.github.io/juicebox-asm/juicebox_asm/struct.Asm.html
[jb-bf]: https://github.com/johannst/juicebox-asm/blob/main/examples/bf.rs
[jb-tiny]: https://github.com/johannst/juicebox-asm/blob/main/examples/tiny_vm.rs
[mmap]: https://www.man7.org/linux/man-pages/man2/mmap.2.html
[intel-im]: https://cdrdv2.intel.com/v1/dl/getContent/671110
[cpp-or]: https://en.cppreference.com/w/cpp/language/overload_resolution
[sysv]: https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build
[bf]: https://brainfuck.org/brainfuck.html
+++
title = "xpost: x64 jit assembler (juicebox-asm)"

[taxonomies]
tags = ["x86", "rust", "linux"]
+++

This is a cross post to a repository implementing an `x64` _jit assembler_,
which I used to study the x86 instruction encoding and how to write such a tool
in rust. The code is available on github [>> juicebox-asm <<][jb].

The _assembler_ only implements a small set of the x86 instruction set, but
enough to create some interesting examples. Additionally, the repository
provides a simple _runtime_ based on [mmap(2)][mmap] (only on linux), which is
used to execute the dynamically assembled code and provided examples.

## Assembler
The _assembler_ implements all the supported instructions and provides a similar
interface as when writing assembly code by hand.

```rust
let mut asm = Asm::new();
asm.xor(rax, rax);
asm.add(rax, rdi);
asm.add(rax, Mem64::indirect_base_index(rdi, rsi));
asm.ret();

asm.disasm();
// Outputs:
//   00000000  4831C0    xor rax,rax
//   00000003  4801F8    add rax,rdi
//   00000006  48030437  add rax,[rdi+rsi]
//   00000006  C3        ret
```

An interesting design-choice is given by the implementation of the assembler API
for the various instructions. Since x86 is a _CISC_ instruction set, it allows
same the mnemonic to be used with different types and permutations of
operands. This can be seen in the code listing above where the _add_ instruction
is used with both `register-register` and `register-memory` operands. It becomes
even clearer when looking into the instruction reference manual, in the example
of the _add_ instruction.

<img src="x86-add.png">

> Table taken from [Intel® 64 and IA-32 Architectures Software Developer’s
> Manual - Volume 2 Instruction Set Reference][intel-im].

Since rust does not have function overloading
<sup id="sup-1-src"><a href="#sup-1-dst">1</a></sup>
as in cpp, one does need to define functions with different function names to
implement the _add_ instruction for different combinations of operands.
```rust
// Pseudo code.
fn add_r64_r64(Reg64, Reg64)
fn add_r64_m64(Reg64, Mem64)
// ..
```

However, in this experiment, I did not go with that approach. Instead I used a
different solution and defined generic traits for the various instructions which
have different operand combinations. The _Add_ trait below gives an example of
that.

```rust
pub trait Add<T, U> {
    fn add(&mut self, op1: T, op2: U);
}
```

This in turn allows to implement the traits for all the different combinations
of operands that should be supported by each instruction. It gives fine control
of what to support and provides an ergonomic API as seen in the first example
code listing.

```rust
impl Add<Reg64, Reg64> for Asm {
    fn add(&mut self, op1: Reg64, op2: Reg64) {
        self.encode_rr(&[0x01], op1, op2);
    }
}

impl Add<Reg64, Mem64> for Asm {
    fn add(&mut self, op1: Reg64, op2: Mem64) {
        self.encode_rm(0x03, op1, op2);
    }
}
```

All instructions currently implemented with their variations can be found in the
documentation of the [Asm][jb-asm] _assembler_.

## Runtime
With the help of the provided runtime, code assembled with the _assembler_ can
be put into executable memory and turned into a function pointer, allowing to
call into the generated code.

```rust
extern "C" fn add(a: u32, b: u32) -> u32 {
    a + b
}

fn main() {
    // SystemV abi:
    //   rdi -> first argument
    //   rsi -> second argument
    //   rax -> return value

    let mut asm = Asm::new();
    asm.mov(rsi, Imm64::from(42));
    asm.mov(rax, Imm64::from(add as usize));
    asm.call(rax);
    asm.ret();

    let mut rt = Runtime::new();
    let add42 = unsafe { rt.add_code::<extern "C" fn(u32) -> u32>(asm.into_code()) };

    let res = add42(5);
    assert_eq!(res, 47);
}
```
> This example must be run on a system with the [SystemV][sysv] ABI.

## Label
A label is used to associate a location in the generated code with a symbolic
name (the label), which can then be used as branch target for control-flow
instructions as for example `jz`, `jnz` and so on.

The act of associating a label with a location is called _binding_. Each label
can only be bound **once**, else the branch target would become _ambiguous_.

Binding and jumping to a label can come in two orders.
1. A label is first bound and then jumped to (backwards-reference).
1. A label is first jumped to and then bound (forwards-reference).

In the following both cases will be discussed in a bit more detail, which
hopefully helps when browsing the implementation. One thing to keep in mind,
since the generated code must be relocatable, all jump instructions will be
encoded relative to the current program counter (pc).

The first case where the label is first bound and later used, is somewhat more
natural to think about, as first defining then using something is pretty
clear. The code listing below shows an example of a loop with a back-reference
summing up all integer between 0 and 42.

```rust,linenos
let mut asm = Asm::new();
let mut lp  = Label::new();

asm.mov(eax, Imm32::from(0));   // int ret = 0;
asm.mov(ecx, Imm32::from(42));  // int n   = 42;

asm.bind(&mut lp);              // lp:
asm.add(eax, ecx);              //   ret += n;
asm.dec(ecx);                   //   --n;
asm.test(ecx, ecx);             //   if (n != 0)
asm.jnz(&mut lp);               //     goto lp

asm.ret();
```

From the point of view of the code buffer this looks as follows. When the label
is bound (line 7), the current offset into the code buffer is recorded as the
labels _location_. Then when emitting the jump instruction (line 11) later, this
location is used with the current code buffers offset to compute the
_displacement_ (disp32) forming the relative jump.

<div style="margin:auto; width:85%;">
<img src="label-lj.svg">
</div>

The second case where the label is first used and then bound later, is not
really more complex, but one has to reverse the steps. The code listing below
gives an example, demonstrating forwards jumps by validating two input
arguments to be not equal to 0 and returning of one check fails.

```rust,linenos
let mut asm = Asm::new();
let mut ret = Label::new();

// Validate inputs (sysv abi).
asm.test(rdi, rdi);     // if (rdi == 0)
asm.jz(&mut ret);       //   return;
asm.test(rsi, rsi);     // if (rsi == 0)
asm.jz(&mut ret);       //   return;

// Here, rdi != 0 && rsi != 0.
// nop as placeholder for some real work.
asm.nop();

asm.bind(&mut ret);
asm.ret();
```

Again from the point of view of the code buffer this looks as follows. When the
jump instructions are emitted (line 6, 8), the _displacement_ can not be
computed, as the label has no defined location yet. A placeholder displacement
of value 0 is emitted into the code buffer and the offset to the displacement
value is recorded by the label as pending _relocation_. Once the label is bound
to a location, all pending relocations are resolved by consuming the recorded
offsets, computing the corresponding displacements and patching the values in
the code buffer.

<div style="margin:auto; width:85%;">
<img src="label-jl.svg">
</div>

## Example: bf

The [bf.rs][jb-bf] example implements a _jit compiler_ for the [brainfuck][bf]
language. In this example the _bf program_ is jit compiled to one code blob as a
whole and evaluated then after. The example only returns from the jit code if
the bf program terminates or a runtime error is detected. For reference an
_interpreter_ is also provided.

## Example: tiny-vm

The [tiny_vm.rs][jb-tiny] example defines a _tiny virtual machine (TinyVm)_ with
a set of _registers_ and custom _instructions_. It then goes on and implements a
_jit compiler_ for the TinyVm. In comparison to the jit compiler in the previous
bf example, the TinyVm compiler jit compiles the program per basic-block (bb),
each time a new bb is encountered. \
After executing a jit compiled basic block, the code returns into the
TinyVm. The vm then checks if the branch target has an entry in the _jit cache_,
which maps guest addresses to compiled basic-blocks. If the cache returns a hit
the TinyVm jumps to the generated code, if the cache returns a miss the new bb is
compiled and inserted into the jit cache.

Similar to the bf example, this example also provides an _interpreter_ as
reference.

## Appendix: x86 instruction encoding
x86 instructions are encoded with _variable length_ and follow the format below.
A single instruction can be as small as 1 byte up to a double-digit number of
bytes.

<img src="x86-insn-fmt.png">

> Figure taken from [Intel® 64 and IA-32 Architectures Software Developer’s
> Manual - Volume 2 Instruction Set Reference][intel-im].

_Prefixes_ are typically modifiers for the instruction, where some examples are
`LOCK` for bus locking, `REP` for instruction repetition and `REX` for
specifying extended operand sizes and to access extended registers (like
r8-r15).

The `ModR/M` and `SIB` bytes are used to encode the register and memory operands
of the instruction in the general case. Whether the `SIB` byte is used depends
on the _addressing mode_.

In case the instruction also requires an _address displacement_ or an _immediate
value_, the bytes of the value directly follow the instruction bytes.

Historically, the x86 ISA only had 8 general-purpose registers (ax, cx, dx, bx,
sp, bp, si, di). This is still visible today in various fields which encode
register operands, where one such example is the **ModRM.Reg** field. Later
on, 8 additional general-purpose registers (r8-r15) were added and one
additional bit was needed to encode all 16 registers, while keeping backwards
compatibility. \
As mentioned above, the **REX** prefix is used to access the _extended_
registers. This prefix contains the **R, X, B** bits, which are combined with
the various register operand fields, as shown by the figures below for the case
of the ModRM byte and the ModRM + SIB bytes.

<div style="margin:auto; width:85%;">
<img src="x86-insn-modrm.png">
<img src="x86-insn-modrm-sib.png">
</div>

> Figures taken from [Intel® 64 and IA-32 Architectures Software Developer’s
> Manual - Volume 2 Instruction Set Reference][intel-im].

The following code listing provides an example for two load instructions with
different _addressing modes_ and gives a detailed breakdown of the instruction
bytes for each instruction.

```rust
let mut asm = Asm::new();
asm.mov(r8, Mem64::indirect_disp(r9, 0xaabb));
asm.mov(r8, Mem64::indirect_base_index(r9, r10));

asm.disasm();
// Outputs:
//   4D8B81BBAA0000    mov r8,[r9+0xaabb]
//   4F8B0411          mov r8,[r9+r10]


// Breakdown of the indirect + displacement load.
//
//   4D 8B 81 BBAA0000    mov r8,[r9+0xaabb]
//   ^  ^  ^  `Imm32
//   |  |  `ModRM: mod=10,reg=000,rm=001
//   |  `Opc: mov
//   `REX: W=1,R=1,X=0,B=1
//
//   REX.R + ModRM.reg = 1000 (r8) - dest reg
//   REX.B + ModRm.rm  = 1001 (r9) - base reg


// Breakdown of the base + index load.
//
//   4F 8B 04 11          mov r8,[r9+r10]
//   ^  ^  ^  `SIB: scale=00,index=010,base=001
//   |  |  `ModRM: mod=00,reg=000,rm=100
//   |  `Opc: mov
//   `REX: W=1,R=1,X=1,B=1
//
//  ModRM.rm = 100 -> use SIB byte
//
//   REX.R + ModRM.reg = 1000 ( r8) - dest reg
//   REX.X + SIB.index = 1010 (r10) - index reg
//   REX.B + SIB.base  = 1001 ( r9) - base reg
```

## References
- [juicebox-asm][jb]
- [osdev: x86 instruction encoding](https://wiki.osdev.org/X86-64_Instruction_Encoding)
- [intel instruction set reference][intel-im]

## Footnotes
<span id="sup-1-dst">**1)**</span>
In cpp functions with the _same_ name but _different_ arguments can be
defined. During compilation [overload resolution][cpp-or] takes place and picks
the "right" function to call on each call site.
<a href="#sup-1-src">&#x21A9;</a>

[jb]: https://github.com/johannst/juicebox-asm
[jb-asm]: https://johannst.github.io/juicebox-asm/juicebox_asm/struct.Asm.html
[jb-bf]: https://github.com/johannst/juicebox-asm/blob/main/examples/bf.rs
[jb-tiny]: https://github.com/johannst/juicebox-asm/blob/main/examples/tiny_vm.rs
[mmap]: https://www.man7.org/linux/man-pages/man2/mmap.2.html
[intel-im]: https://cdrdv2.intel.com/v1/dl/getContent/671110
[cpp-or]: https://en.cppreference.com/w/cpp/language/overload_resolution
[sysv]: https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build
[bf]: https://brainfuck.org/brainfuck.html