Watch your allocations

Recently, I've been working on a small project in the Zig programming language where I can fuzzy filter items in a list. It does this by comparing two strings and calculating how similar they are by giving each one a score, and then ranking them by that score.

The algorithm I landed on for this was the Levenshtein distance, mostly since it seems to be the most common of them. It uses a matrix to compare the strings to calculate how many edits are needed to make them the same (also known as the edit distance).

If you haven't used Zig before, it forces you to be explicit about the allocator you're using when allocating memory, usually by passing it in as an argument. More on that in a bit.

Here's the function I made for comparing two strings and getting back the edit distance.

pub fn distance(allocator: std.mem.Allocator, s: []const u8, t: []const u8) !u8 {
    // Allocate an array that is the size of the two strings multiplied.
    const array = try allocator.alloc(u8, (s.len + 1) * (t.len + 1));
    defer allocator.free(array);

    // Compute and return the distance…
}

Don't worry about the details, but do note that for every comparison we do an allocation for the matrix. Since the matrix size varies for each candidate string, we can't allocate it up front (without either guessing or doing a pass to find the maximum length first).

With a search string of "harder better faster stronger" and a file with ~20k songs and artists, running this on my Thinkpad using hyperfine, gives a result of around 110-120ms.

❯ zig build --release=safe
❯ hyperfine -w 5 './zig-out/bin/fuzzy "harder better faster stronger" library.txt'
Benchmark 1: ./zig-out/bin/fuzzy "harder better faster stronger" library.txt
  Time (mean ± σ):     116.6 ms ±   6.7 ms    [User: 50.2 ms, System: 64.4 ms]

Not great.

Looking at the benchmark numbers, we can see that we actually spend more time in kernel calls than running our actual logic. This seems to suggest that our allocations are taking a significant portion of our execution time.

Enter the arena

When allocating the memory for the matrix, we're using the regular std.heap.page_allocator, which is similar to how you'd use malloc in C.

So, why would this hinder us? Every time we want to allocate the memory for our matrix, we first need to make a syscall to the operating system and wait for it to give us back the memory we need to proceed. This operation is not free, and since we do this in our loop of comparisons this will happen a lot.

But this is not the only allocator that we have available to us.

There is another option called ArenaAllocator which will allocate a larger chunk of memory up front, and then "allocate" within that when we want some memory. This avoids making another syscall but rather returns a chunk within the larger arena.

Now, how much effort would it be to change to an arena allocator and see what happens? Sounds like a lot of work just to try it out…

-    const allocator = std.heap.page_allocator;
+    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
+    const allocator = arena.allocator();

     while (lines.next()) |line| {
         const distance = try fuzzy.levenshtein.distance(allocator, term, lowered);
         try priority_queue.add(distance);
     }

We only swap out the allocator used when calling our distance function - that's it.

Instead of using our regular page allocator, create a new arena allocator from that, and then we're done.

Now, let's rerun our program again and see what happens.

❯ zig build --release=safe
❯ hyperfine -w 5 './zig-out/bin/fuzzy "harder better faster stronger" library.txt'
Benchmark 1: ./zig-out/bin/fuzzy "harder better faster stronger" library.txt
  Time (mean ± σ):      58.6 ms ±   8.2 ms    [User: 57.2 ms, System: 1.2 ms]

That's almost a 2x speedup. By a two-line change!

And what's really cool about this is that we didn't have to modify our code. We could've done something similar in other languages by allocating more memory up-front, but that would change how we'd write our function.

Here, since the allocator is explicitly passed in as an injected dependency, we only rely on the interface of the allocator and not how it works. So our function doesn't care how the allocation happens, just that it does.

The point

One of the things I like about Zig is just this - the explicitness. It forces you to be aware that a specific call will allocate memory, and that invites you to think whether or not that is a good place to do that work.

There are probably healthy habits that most developers could benefit from having -- one of them is being aware of patterns that may slow down performance. Maybe not to eke out every last drop of performance, but rather to make sure that we're not doing silly things because we didn't think it through. And being explicit like this can help you spot that.

Now, there are probably lots of other optimizations we could explore on this problem, such as making it multi-threaded or applying better filtering to do fewer overall comparisons.

But for the amount of speedup we could get by simply thinking through the problem a bit, and how easy it was to change the allocator, this was a pretty unexpected win.