On Control Flow and Compiling Lots of Rust Quickly

Disclaimer: this post about a bit of my work I thought might be interesting and is specific to using Rust as a code generation target, so will not serve as a general guide to improving Rust compile times.

As part of my PhD (2 years down, 1 to go!) I’m developing a compiler, borealis, that takes a Sail instruction set architecture (ISA) description and produces a fast emulator of that architecture. For example, you put the Aarch64 model in and you get an x86_64 program that itself runs Aarch64 programs out.

The plan is to first generate simple (and slow) interpreters, verify correctness, then work on generating faster dynamic binary translation systems.

I wouldn’t want to fix what ain’t broke, so borealis links against the Sail compiler as it is very helpfully structured as a library with pluggable components. The linking and FFI setup between Rust and OCaml is gnarly though.

The Sail compiler produces a low-level IR called JIB, borealis takes this JIB and processes it through several IRs and compiler passes.

Control Flow

An earlier version of borealis used GenC (C-like language used in the GenSim toolchain) as the output target. This was intended as a proof-of-concept stepping-stone by allowing me to re-use the existing GenSim infrastructure for generating interpreters and DBTs.

JIB only contains unstructured control flow; JIB basic blocks contain jump statements that conditionally or unconditionally jump to another basic block. GenC contains no gotos or other unstructured control flow, only structured control flow (if and switch statements).

So what do we do with this mismatch?

Option 1: Do Nothing

We can emit if statements for every conditional jump and in the true and false bodies, recurse with our code generator. This does generate valid code, it just generates a lot of valid code. With this approach, no code paths will ever re-join so any shared future logic is duplicated across both branches. This duplication then occurs on every conditional resulting in an explosion in generated code size.

Option 2: Reverse Greatest Common Ancestor/Breadth First Search

If we could find that child block (blue in the above diagrams), then we could only code generate child blocks up that common block, close the if statement, and avoid unnecessary duplication.

This problem of finding when paths share a common node is an existing graph problem. I tried adapting the greatest common ancestor algoirthm but I either had a bug or a misunderstanding because it would not give the correct output.

Just trying to record all possible paths used 50GB+ of memory for even small functions. However, a form of breadth first search does slightly better. All paths from the current node were stored, and grown one at a time until either a common node or the preset max depth was hit. This solution worked but was still incredibly memory and CPU intensive.

Option 3: Couple’s Walk

After struggling with this for a few weeks I spoke to one of my supervisors who almost immediately suggested the following algorithm: send two walkers down each side of a conditional then compare their walked paths for common blocks.

The key idea is that if there exists a common child block, all that matters is that it is reachable from both sides of the current node. The path through each side is irrelevant. In my implementation they always take the left target of any conditionals, but you could even have walkers choose randomly.

This has linear complexity in both time and space, but importantly has great wall-clock performance.

Rust

With the control flow issues solved, I continued with the rest of the implementation and emitted a portion of the ARMv8.5 Sail model in GenC, allowing for a simple Fibonacci Aarch64 program to be successfully emulated on an x86_64 host.

Unfortunately, as more of the model was enabled, scability issues within the GenSim toolchain were encountered. 24+ hour compile times on each change just isn’t compatible with only 3.5 years of PhD funding.

The decision was made to power through and emit Rust instead. First an interpreter, with the goal of emitting a Rust dynamic binary translator down the road.

In Rust, there are no goto statements. But unlike GenC, Rust has loops, allowing for a straightforward initial implemention: a loop containing a match on a current block index variable, match arms containing the contents of each block, which may jump by assigning the target block index to the local variable.

    const BLOCK_0: u32 = 0u32;
    const BLOCK_1: u32 = 1u32;
    const BLOCK_2: u32 = 2u32;
    let mut gs_24618: bool = Default::default();
    let mut gs_24617: bool = Default::default();
    let mut current_block = BLOCK_0;
    loop {
        match current_block {
            BLOCK_0 => {
                // C s_0_0: const #() : ()
                let s_0_0: () = ();
                // S s_0_1: call DebugTarget(s_0_0)
                let s_0_1: u8 = DebugTarget(state, tracer, s_0_0);
                // S s_0_2: call ELUsingAArch32(s_0_1)
                let s_0_2: bool = ELUsingAArch32(state, tracer, s_0_1);
                // S s_0_3: not s_0_2
                let s_0_3: bool = !s_0_2;
                // N s_0_4: branch s_0_3 b8 b1
                current_block = if s_0_3 { BLOCK_8 } else { BLOCK_1 };
            }
            BLOCK_1 => {
                // C s_1_0: const #0u : u1
                let s_1_0: bool = false;
                // D s_1_1: write-var gs#24617 <= s_1_0
                gs_24617 = s_1_0;
                // N s_1_2: jump b2
                current_block = BLOCK_2;
            }
            ...
        }
    }

Workspaces

Emitting all functions into a single .rs file worked well for a while and the whole model existing as a single Rust module was very convenient. However, as more of the model was enabled and we were generating over 300k lines of code the compile times were growing into multiple hours.

Since Rust crates (not modules or files) are the smallest unit of concurrent compilation the next step seemed to be emitting a workspace of crates. The question then is what to put in each crate. We already have a dependency graph of functions, so given the range from “all functions in a single crate” (minimises cargo overheads) to “one function per crate” (maximises concurrent compilation), there must be some optimal grouping of functions. The dependency graph must be upheld without introducing cycles (but cycles between functions within a crate is allowed) and if each function (now a node in the graph) is given a weight based on it’s size, the goal is to find N regions of the graph with equal weights. This problem attracted many academics in an impromptu meeting in the coffee lounge of the Computer Science department; there are many ways to approach this problem in different branches of graph theory!

However, I feel a bit bad about not actually checking what the overheads of thousands of crates are in cargo first, because I just wrote a quick implementation placing one function in each crate and it brought the compile time down massively. There was just a few second pause at the start of each cargo invocation which is totally fine. One function per crate also minimises recompilation for any changes (since changes are detected based on file last-modified timestamps borealis calculates a diff between the current filesystem and the desired output then only writes changed files).

What I hadn’t checked however was the distribution of function sizes: a few large functions were taking nearly all the compilation time. I really wasn’t sure how I could optimize the generated code for compile time, or how best to split functions into smaller units, I made a post on r/rust for advice. I received some incredibly helpful feedback and suggestions, in particular I’d like to thank matthieum and gmorenz for their suggestions.

The main advice was that the block-matching loop is unlikely to be optimized into it’s underlying collection of basic blocks by the Rust compiler, and large functions are slow to process in Rust and LLVM. Instead if each block is emitted as it’s own function we avoid emitting large functions and we have a structure much more amenable to optimization. In particular, tail call optimization kicks in without issue at opt-level=1 and greater (I didn’t even need any attributes or flags). We create a struct holding all function state that is passed through the calls to blocks.

pub fn AddWithCarry<T: Tracer>(
    state: &mut State,
    tracer: &T,
    x: Bits,
    y: Bits,
    carry_in: bool,
) -> ProductType19be375472571c5f {
    #[derive(Default)]
    struct FunctionState {
        n: bool,
        v: bool,
        signed_sum: i128,
        c: bool,
        z: bool,
        unsigned_sum: i128,
        x: Bits,
        y: Bits,
        carry_in: bool,
    }
    let fn_state = FunctionState {
        x,
        y,
        carry_in,
        ..Default::default()
    };
    return block_0(state, tracer, fn_state);
    fn block_0<T: Tracer>(
        state: &mut State,
        tracer: &T,
        mut fn_state: FunctionState,
    ) -> ProductType19be375472571c5f {
        // D s_0_0: read-var x:bv
        let s_0_0: Bits = fn_state.x;
        // D s_0_1: cast zx s_0_0 -> i
        let s_0_1: i128 = (s_0_0.value() as i128);
        // D s_0_2: read-var y:bv
        let s_0_2: Bits = fn_state.y;
        // D s_0_3: cast zx s_0_2 -> i
        let s_0_3: i128 = (s_0_2.value() as i128);
        // D s_0_4: add s_0_1 s_0_3
        let s_0_4: i128 = (s_0_1 + s_0_3);
        // D s_0_5: read-var carry_in:u8
        let s_0_5: bool = fn_state.carry_in;
        // D s_0_6: cast zx s_0_5 -> bv
        let s_0_6: Bits = Bits::new(s_0_5 as u128, 1u16);
        // D s_0_7: cast zx s_0_6 -> i
        let s_0_7: i128 = (s_0_6.value() as i128);
        // D s_0_8: cast reint s_0_7 -> i64
        let s_0_8: i64 = (s_0_7 as i64);
        // D s_0_9: cast zx s_0_8 -> i
        let s_0_9: i128 = (i128::try_from(s_0_8).unwrap());
        // D s_0_10: add s_0_4 s_0_9
        let s_0_10: i128 = (s_0_4 + s_0_9);
        // D s_0_11: write-var unsigned_sum <= s_0_10
        fn_state.unsigned_sum = s_0_10;
        ...
        // N s_0_62: branch s_0_61 b9 b1
        if s_0_61 {
            return block_9(state, tracer, fn_state);
        } else {
            return block_1(state, tracer, fn_state);
        };
    }
}

Results

Between the workspace and function changes, for the same code size, we went from a 6 hour cold compile time down to 3 minutes, and the resulting interpreter ran 4x faster. I’m really pleased with this and grateful for all the help I received.

Enabling the rest of the ARM model appears to have resulted in linear compile time growth with input size which is fantastic too.

Ongoing Issues

There are some ongoing issues I’d like to solve, in no particular order.

I love sccache, but since I’m using a custom target JSON I can’t use it for either caching or distributing compilation, finding out a way of doing one or both of those would be great
./target directory gets into the hundreds of GBs with just a few hours of working on the project, this isn’t a huge deal but it would be nice to be able to trim old artifacts without cleaning entirely
cargo pauses for about 10-15 seconds when running commands in projects that depend on a crate in the generated workspace. However, running cargo in the workspace itself results in a 60+ second stall before I see the usual output. I’m assuming this is all crate graph dependency resolution, but I need to investigate.

fmckeogh.github.io