Benchmarking io_uring(direct_io) configurations using fio

Ozgur Akkurt | 2025-08-16

Intro

io_uring is the new API for doing asynchronous IO on linux. It is particularly good for doing disk IO.

fio is a tool for benchmarking and testing IO on linux. It is able to use io_uring and it lets us configure how it uses io_uring in depth.

In this post we will go through some different configurations that are all described in the io_uring document and see what parameters make the most impact on file IO(direct_io) performance when using io_uring.

1) Test System

We are using a machine with:

  • debian bookworm (6.1 kernel). A more recent kernel might give significantly different results as io_uring is improving. But this OS/kernel seems like the most common option for a production server at the time of writing.
  • kernel parameter nvme.poll_queues=16.
  • "datacenter" NVMe SSD without any RAID. This SSD seems to peak out at 7000MB/s read using multiple threads with optimal config.
  • 756 GB of RAM. This amount of RAM should be irrelevant for this test. Since we are using direct_io hence there is no caching for the file.
  • Ryzen EPYC CPU with 32 cores (64 threads). CPU should be mostly irrelevant for this test since we are using only one thread.

2) Config Parameters

fio can be called like fio my_config with a config file. Our baseline config file is this:

[global]
direct=1
bs=4K
rw=randread
ioengine=io_uring
iodepth=64
size=16g
hipri=0
fixedbufs=0
registerfiles=0
sqthread_poll=0
numjobs=1

[job0]
filename=/root/testfile

Fixed parameters

  • direct=1 -> We are doing direct_io on a single file with a single thread.
  • bs=4K -> We are reading a 4 KB block from disk per read.
  • rw=randread -> We are only doing random reads from the file and no writes.
  • ioengine=io_uring -> We are using io_uring as the io engine, so fio uses io_uring for doing disk io.
  • iodepth=64 -> We have iodepth as 64 which might correspond to how many IO operations we are doing at a time. The queue depth and concurrency in all these APIs seem to be blurry overall.
  • size=16g -> The file we use for testing is 16GB. Size of the file vs. how much ram we have shouldn't matter since we are doing direct_io so there is no caching.
  • numjobs=1 -> We are only using a single thread/process.

Tested parameters

  • hipri=? -> This parameter controls the IOPOLL parameter of io_uring here. Enabling this means kernel will poll the SSD for completions instead of SSD issuing interrupts for IO completions.
  • fixedbufs=? -> This parameter controls the buffer registration feature of io_uring. You can read more about this in the the io_uring document.
  • registerfiles=? -> This parameter controls the file descriptor registering feature of io_uring. You can read more about this in the the io_uring document.
  • sqthread_poll=? -> This parameter controls the SQTHREAD_POLL feature of io_uring. This feature makes kernel spawn a busy polling thread that will read the command queue so the user thread doesn't have to do any syscalls to let kernel know that there are IO commands in the queue. You can read more about this in the the io_uring document.

4) Results Table

ConfigurationBandwidth (avg)Latency Distribution (usec)IOPS (avg)
Baseline936.54 MiB/s250=9.40%, 500=90.60%239754.41
WITH hipri1203.73 MiB/s250=99.72%, 500=0.28%308154.52
WITH sqthread_poll1383.20 MiB/s(Not reported)354098.26
WITH fixedbufs996.56 MiB/s250=68.35%, 500=31.62%255118.94
WITH registeredfiles951.76 MiB/s250=20.77%, 500=79.23%243649.71
WITH ALL1788.14 MiB/s(Not reported)457763.11
WITHOUT registeredfiles1787.38 MiB/s(Not reported)457570.33
WITHOUT fixedbufs1617.83 MiB/s(Not reported)414164.40
WITHOUT sqthread_poll1318.05 MiB/s250=99.53%, 500=0.47%337420.33
WITHOUT hipri1508.15 MiB/s(Not reported)386087.33

5) Detailed Results

a) Baseline

Using the base config.

config

[global]
direct=1
bs=4K
rw=randread
ioengine=io_uring
iodepth=64
size=16g
hipri=0
fixedbufs=0
registerfiles=0
sqthread_poll=0
numjobs=1

[job0]
filename=/root/testfile

results

bw (  KiB/s): min=915616, max=966640, per=100.00%, avg=959017.18, stdev=10325.42, samples=34
lat (usec)   : 250=9.40%, 500=90.60%, 750=0.01%, 1000=0.01%
iops        : min=228904, max=241660, avg=239754.41, stdev=2581.25, samples=34

b) WITH hipri (IO_POLL)

Same as base config except hipri=1

results

bw (  MiB/s): min= 1152, max= 1210, per=100.00%, avg=1203.73, stdev=10.63, samples=27
lat (usec)   : 250=99.72%, 500=0.28%, 750=0.01%, 1000=0.01%
iops        : min=295052, max=309898, avg=308154.52, stdev=2721.09, samples=27

c) WITH sqthread_poll

Same as base config except sqthread_poll=1

results

bw (  MiB/s): min= 1352, max= 1391, per=100.00%, avg=1383.20, stdev= 8.14, samples=23
iops        : min=346314, max=356152, avg=354098.26, stdev=2082.96, samples=23

There is no latency reported when using sqthread_poll.

d) WITH fixedbufs

Same as base config except fixedbufs=1

results

bw (  KiB/s): min=980920, max=1031152, per=100.00%, avg=1020475.50, stdev=10727.21, samples=32
lat (usec)   : 250=68.35%, 500=31.62%, 750=0.02%, 1000=0.01%
iops        : min=245230, max=257788, avg=255118.94, stdev=2681.88, samples=32

e) WITH registeredfiles

Same as base config except registeredfiles=1

results

bw (  KiB/s): min=924768, max=979736, per=100.00%, avg=974597.88, stdev=9114.23, samples=34
lat (usec)   : 250=20.77%, 500=79.23%, 750=0.01%, 1000=0.01%
iops        : min=231192, max=244934, avg=243649.71, stdev=2278.60, samples=34

f) WITH ALL

Enabling all features.

config

[global]
direct=1
bs=4K
rw=randread
ioengine=io_uring
iodepth=64
size=16g
hipri=1
fixedbufs=1
registerfiles=1
sqthread_poll=1
numjobs=1

[job0]
filename=/root/testfile

results

bw (  MiB/s): min= 1748, max= 1805, per=100.00%, avg=1788.14, stdev=15.05, samples=18
iops        : min=447624, max=462126, avg=457763.11, stdev=3852.45, samples=18

This unsurprisingly gives the best results.

g) WITHOUT registeredfiles

All features enabled except registedfiles.

results

bw (  MiB/s): min= 1770, max= 1794, per=100.00%, avg=1787.38, stdev= 6.11, samples=18
iops        : min=453220, max=459434, avg=457570.33, stdev=1564.26, samples=18

Doesn't seem to have much of an effect. Some metrics are even better.

h) WITHOUT fixedbufs

All features enabled except fixedbufs.

results

bw (  MiB/s): min= 1594, max= 1623, per=100.00%, avg=1617.83, stdev= 6.34, samples=20
iops        : min=408218, max=415512, avg=414164.40, stdev=1622.91, samples=20

i) WITHOUT sqthread_poll

All features enabled except sqthread_poll.

results

bw (  MiB/s): min= 1270, max= 1327, per=100.00%, avg=1318.05, stdev=11.40, samples=24
lat (usec)   : 250=99.53%, 500=0.47%, 750=0.01%, 1000=0.01%
iops        : min=325160, max=339936, avg=337420.33, stdev=2917.80, samples=24

j) WITHOUT hipri

All features enabled except hipri (IO polling mode).

results

bw (  MiB/s): min= 1503, max= 1513, per=100.00%, avg=1508.15, stdev= 3.30, samples=21
iops        : min=384848, max=387366, avg=386087.33, stdev=843.81, samples=21

6) Conclusion

  • Seems like registering files doesn't give a tangible performance improvement contrary to what the io_uring document says. Of course this could be a problem with the test system or fio or something else.
  • Both sqthread_poll and hipri(io_poll) seem essential to achieve high IOPS and throughput.
  • Registering io buffers seems to make a decent difference.
  • With the right config, we are able to achieve really high IOPS and throughput even on a single thread using io_uring/direct_io.