
Optimizing ZFS Recordsize: The “Hidden” Variable for Massive Performance Gains
If you’ve been diving into the world of ZFS, you’ve probably spent hours debating RAIDZ levels, L2ARC sizes, and RAM overhead. But there is one setting that often flies under the radar, yet has more impact on your daily IOPS than almost any other: The Recordsize.
Whether you are running a Proxmox cluster, a Plex media server, or a high-performance database, setting the wrong recordsize is like driving a Ferrari in first gear.
Let’s break down how to master this setting and, more importantly, how to verify the results using tools like FIO.
What exactly is ZFS Recordsize?
Think of the recordsize as the maximum size of the data blocks ZFS writes to your pool. By default, ZFS uses 128K.
ZFS is a variable-block-size file system. If you write a 4K file, it will only use 4K on disk. However, if you write a 1MB file, ZFS will chop it into 128K chunks (by default).
Why does this matter?
Small recordsize (e.g., 4K or 16K): Great for databases (MySQL, PostgreSQL) because it matches how they talk to the disk. It prevents “Write Amplification.”
Large recordsize (e.g., 1M): Incredible for media files or backups. It increases throughput and improves compression ratios.
The Proxmox & VM Trap
If you are running Virtual Machines on ZFS (common in Proxmox or TrueNAS), you might be suffering from a massive performance penalty without knowing it.
Most VM disk images (zvols) use an 8K block size by default. If your underlying ZFS pool is optimized for 128K, every time your VM wants to change a tiny 8K piece of data, ZFS might end up doing more work than necessary. This leads to high latency and “iowait.”
The Golden Rule for VMs: Match your ZFS volblocksize to your VM’s internal file system cluster size. For most Linux VMs, 16K or 64K is often the “sweet spot” balance between performance and storage efficiency.
How to Test Your Performance (The Scientific Way)
Don’t guess—benchmark. To see if your recordsize is working for you, you need to simulate real-world workloads using FIO (Flexible I/O Tester).
Here is a command to test Random Write performance, which is where a wrong recordsize hurts the most:
Bash
fio --name=random-write --ioengine=libaio --rw=randwrite --bs=4k --size=2G --numjobs=1 --iodepth=64 --runtime=60 --time_based --output-format=json --filename=/your/zfs/path/testfile
Analyzing the Results
Once you run this, you’ll get a complex JSON output. This is where most people get lost.
Pro Tip: Look at your p99 latency. If your recordsize is poorly optimized, you’ll see massive spikes here. You can take this JSON output and paste it into our
The Verdict: Which size should you choose?
Bit Torrent & Large Media: Go for 1M. It reduces metadata overhead and makes your scrubs much faster.
General File Sharing: Stick with the default 128K.
Databases (MySQL/Postgres): Match the page size (usually 16K).
Virtualization (Proxmox/KVM): Use 16K or 64K for your zvols.
Conclusion
ZFS isn’t “set and forget.” It’s a precision instrument. By taking 10 minutes to adjust your recordsize to match your data type, you can often see a 2x to 5x improvement in responsiveness.
What about you? Have you benchmarked your pool lately? Run an FIO test today, and if the numbers look weird, it might be time to tweak that recordsize!
| Workload | Recommended Recordsize | Why? |
| Databases | 16K | Matches page size, reduces overhead |
| Virtual Machines | 32K – 64K | Balance between IOPS and space |
| General Files | 128K | Best default for mixed usage |
| Media / Backups | 1M | Maximum throughput & compression |
Deep Dive: The Cost of Metadata and Ashift
If you want to go even deeper, you must consider the Ashift value (block size of your physical disks). Most modern drives use 4K native sectors (ashift=12).
If you set a recordsize that is too small (like 4K) on a RAIDZ array, you might run into a massive space inefficiency issue. Because ZFS needs to store parity for every record, a small recordsize on a wide RAIDZ pool can lead to a situation where parity takes up as much space as the data itself!
Pro Tip for NVMe Users: If you are using high-end NVMe drives, don’t be afraid to experiment with recordsize=16K for your OS drives. While 128K is the default for general storage, the near-zero seek time of NVMe allows ZFS to handle smaller records with almost no performance penalty, significantly reducing write amplification and extending the life of your SSD cells.


Related Articles
Seagate Mozaic 3+: The HAMR Revolution Shattering Storage Limits (30TB and Beyond)
As Artificial Intelligence and Big Data continue to...
RAIDZ1 vs. RAIDZ2: Why Your 20TB, 32TB+ Drives Change the Rules
If you are building a NAS in 2026, you likely have your eyes on high-capacity drives. Prices are dropping, and the idea of...


