Hitting Peak File IO Performance with Zig

Ozgur Akkurt | 2025-09-04

Intro

This post goes through how to maximize file IO performance on linux using zig with io_uring.

All code related to this post can be found in this repo.

a) Benchmark

We are comparing fio and the zig code which can be found here.

test system

We are using a machine with:

ubuntu 24.04 (6.14 kernel, HWE).
kernel parameter nvme.poll_queues=16.
"datacenter" NVMe SSD without any RAID.
756 GB of RAM. This amount of RAM should be irrelevant for this test. Since we are using direct_io hence there is no caching for the file.
Ryzen EPYC CPU with 32 cores (64 threads). CPU should be mostly irrelevant for this test since we are using only one thread.

fio config

Using fio version 3.36

[global]
direct=1
bs=512K
rw=write
ioengine=io_uring
iodepth=64
size=16g
hipri=1
fixedbufs=1
registerfiles=0
sqthread_poll=1
numjobs=1

[job0]
filename=testfile

Just changing rw=write to rw=read to do the read benchmark.

zig code config

The example does 64 512KB sequential reads at a time, same as the fio configuration. It also writes/reads the entire 16GB file same as fio.

The underlying library uses the exact same features as configured in the fio config as well.

Zig code also uses a single thread same as fio.

benchmark numbers

The result I get from fio is 4.083 GB/s write, and 7.33 GB/s read. Not including more detailed statistics that fio gives since the benchmark script doesn't have those.

The result for the diorw example in zig is 3.802 GB/s write, and 6.996 GB/s read.

The zig code is a bit slower than fio but it manages to hit expected numbers for the SSD. Also the fio run= timing exactly (to the millisecond) matches the zig code's timings so there might be some difference in bandwidth measurement.

b) Implementation

Most of the implementation follows concepts found in glommio which is a similar library written in Rust.

io_uring usage follows the io_uring document.

1) How to use io_uring for file IO

In my last blog post I benchmarked different io_uring parameters using fio to find which parameters gave a significant performance benefit.

This was crucial to designing the IO library so it is using the most significant features that make file IO performant.

What I found in that article shows that using polled io instead of interrupt driven. And using the kernel side busy polling (SQ_THREAD_POLL) feature of io_uring is crucial for performance.

Also using registered buffers intead of regular buffers gives a significant performance boost.

2) Using polled IO

We need an io_uring instance with the IOPOLL feature enabled in order to make use of this feature. But if we enable IOPOLL then we can't do any operation except direct_io reads and writes.

So we need two io_uring instances, one with IOPOLL enabled and one without it.

I don't think glommio or other similar mainstream libraries use this feature since using this feature requires that nvme.poll_queues are enabled in linux kernel parameters and I didn't see any errors when I was using glommio even though I didn't have this feature enabled in my setups before.

I opted to enable this feature because it is important for performance.

3) Using registered buffers for file IO

Since we need to pre-register the buffer to make this effective. Most sensible approach seems to be to allocate the amount of memory we need for this beforehand and use that memory for all operations. This is the same approach that glommio uses.

This means user of the library can't pass in memory that they allocated for doing read/write. And we will have a relatively low limit for the amount of io memory we can use at any given time.

So we implement a buffer interface, this buffer is returned by the read call and the user has to release it back so the library can reuse the memory. And they have to allocate and fill one of these buffers before doing a write and pass it in.

On the read path we can make it easy for user and handle alignment inside the library by over allocating to meet the alignment requirements of direct_io. Only caveat the user needs to keep in mind is the fact that there can be read amplification if they issue unaligned reads and this approach works with perfect efficiency as long as the user is issuing aligned reads.

On the write path, we want to make alignment explicit and require the user passes a perfectly aligned and sized buffer. It doesn't seem possible to make alignement internal to the library without incurring overhead transparent to the user when doing writes. This is because we have to write a multiple of the alignment but user might request to write less than the alignment, then we have to read the missing part of the alignment, combine it with user data and then write it. This can be much worse then just reading some extra data like in the read path case.

I also left out doing automatic merging of io inside the library like glommio does in order to keep it simple and because I'm concerned the application will have to implement something complicated on top anyway and this can be implement there in a better way. This perspective comes from my previous use of glommio.

4) Using the `SQTHREAD_POLL` feature

This feature creates a kernel side busy thread that is constantly checking if there are any requests submitted by the user space into the submission queue and also in our case it makes this thread poll the SSD for io completions so it ends up making the library code a bit simpler.

This feature doesn't seem to be enabled in other similar libraries probably because it used to require user privilages in OS level on older versions of the kernel, and also it runs a busy thread and maxes out a CPU core completely.

The maxing of the CPU core can sound bad but you can actually use a single busy kernel thread for multiple of your user space io_uring instances using the WQ_ATTACH flag. It doesn't sound so bad if you are running 32 threads for the application and if you consider the performance benefit this feature gives.