Optimizing ZFS Recordsize: The “Hidden” Variable for Massive Performance Gains

If you’ve been diving into the world of ZFS, you’ve probably spent hours debating RAIDZ levels, L2ARC sizes, and RAM overhead. But there is one setting that often flies under the radar, yet has more impact on your daily IOPS than almost any other: The Recordsize.

Whether you are running a Proxmox cluster, a Plex media server, or a high-performance database, setting the wrong recordsize is like driving a Ferrari in first gear.

Let’s break down how to master this setting and, more importantly, how to verify the results using tools like FIO.

What exactly is ZFS Recordsize?

Think of the recordsize as the maximum size of the data blocks ZFS writes to your pool. By default, ZFS uses 128K.

ZFS is a variable-block-size file system. If you write a 4K file, it will only use 4K on disk. However, if you write a 1MB file, ZFS will chop it into 128K chunks (by default).

Why does this matter?

Small recordsize (e.g., 4K or 16K): Great for databases (MySQL, PostgreSQL) because it matches how they talk to the disk. It prevents “Write Amplification.”
Large recordsize (e.g., 1M): Incredible for media files or backups. It increases throughput and improves compression ratios.

The Proxmox & VM Trap

If you are running Virtual Machines on ZFS (common in Proxmox or TrueNAS), you might be suffering from a massive performance penalty without knowing it.

Most VM disk images (zvols) use an 8K block size by default. If your underlying ZFS pool is optimized for 128K, every time your VM wants to change a tiny 8K piece of data, ZFS might end up doing more work than necessary. This leads to high latency and “iowait.”

The Golden Rule for VMs: Match your ZFS volblocksize to your VM’s internal file system cluster size. For most Linux VMs, 16K or 64K is often the “sweet spot” balance between performance and storage efficiency.

How to Test Your Performance (The Scientific Way)

Don’t guess—benchmark. To see if your recordsize is working for you, you need to simulate real-world workloads using FIO (Flexible I/O Tester).

Here is a command to test Random Write performance, which is where a wrong recordsize hurts the most:

Bash

fio --name=random-write --ioengine=libaio --rw=randwrite --bs=4k --size=2G --numjobs=1 --iodepth=64 --runtime=60 --time_based --output-format=json --filename=/your/zfs/path/testfile

Analyzing the Results

Once you run this, you’ll get a complex JSON output. This is where most people get lost.

Pro Tip: Look at your p99 latency. If your recordsize is poorly optimized, you’ll see massive spikes here. You can take this JSON output and paste it into our FIO Results Viewer to see a clean breakdown of your IOPS and Latency percentiles.

The Verdict: Which size should you choose?

Bit Torrent & Large Media: Go for 1M. It reduces metadata overhead and makes your scrubs much faster.
General File Sharing: Stick with the default 128K.
Databases (MySQL/Postgres): Match the page size (usually 16K).
Virtualization (Proxmox/KVM): Use 16K or 64K for your zvols.

Conclusion

ZFS isn’t “set and forget.” It’s a precision instrument. By taking 10 minutes to adjust your recordsize to match your data type, you can often see a 2x to 5x improvement in responsiveness.

What about you? Have you benchmarked your pool lately? Run an FIO test today, and if the numbers look weird, it might be time to tweak that recordsize!

Workload	Recommended Recordsize	Why?
Databases	16K	Matches page size, reduces overhead
Virtual Machines	32K – 64K	Balance between IOPS and space
General Files	128K	Best default for mixed usage
Media / Backups	1M	Maximum throughput & compression

Deep Dive: The Cost of Metadata and Ashift

If you want to go even deeper, you must consider the Ashift value (block size of your physical disks). Most modern drives use 4K native sectors (ashift=12).

If you set a recordsize that is too small (like 4K) on a RAIDZ array, you might run into a massive space inefficiency issue. Because ZFS needs to store parity for every record, a small recordsize on a wide RAIDZ pool can lead to a situation where parity takes up as much space as the data itself!

Pro Tip for NVMe Users: If you are using high-end NVMe drives, don’t be afraid to experiment with recordsize=16K for your OS drives. While 128K is the default for general storage, the near-zero seek time of NVMe allows ZFS to handle smaller records with almost no performance penalty, significantly reducing write amplification and extending the life of your SSD cells.