Introducing muxfs

a mirroring, checksumming, and self-healing filesystem layer for OpenBSD

Date published: 2022-08-08
Author: Stephen D. Adams <stephen@sdadams.org>

The problem: data corruption

I decided it was finally time to build a file server to centralize my files and guard them against bit-rot. Although I would have preferred to use OpenBSD due to its straightforward configuration and sane defaults, I was surprised to find that none of the typical NAS filesystems were supported.

In particular I would need three features in such a filesystem. Firstly I would need data to be stored redundantly across multiple disks so should one disk fail there would be at least one other disk from which the data can be recovered. Secondly I would need all data and metadata to have their checksums stored alongside them so that if one disk yields corrupted data then the valid and invalid copies would be identifiable and restoration could proceed without the risk of propagating the corruption. Finally such a filesystem should automatically check and repair data as it is accessed rather than processing the entire filesystem tree upon every check or repair job.

For this final point it is not that checking the entire tree should not be possible, rather that for this expensive task it should be necessary neither to manually invoke it frequently, nor to cron schedule it to be invoked frequently. The inconvenience, and time and energy wasted aside regularly processing through the entire contents of a disk needlessly shortens its lifespan.

Solutions I considered

Hardware RAID and softraid

There is already plenty of discussion on the internet concerning the viability of RAID as a bit-rot mitigation tool so I will not go into much detail here, but for those unaware I suggest to at least read up on the term "write hole".

The more expensive RAID cards can address such issues but might simply shift the problem over to, for example, the additional maintenance burden of checking and replacing the on-card batteries.

OpenBSD's softraid(4) is close to a solution to the problem, but we can see the following from the manual page:

  • "The driver relies on underlying hardware to properly fail chunks."
  • "Currently there is no automated mechanism to recover from failed disks."
  • "Certain RAID levels can protect against some data loss due to component failure."

Note that the data protected from loss is qualified as some. There is also no mention of support for a journal device that would be needed to address the write hole issue.

If my use case were a webserver serving many images or videos to a large userbase then I would store the content using softraid since small-but-rare corruptions to the content would be significantly less bad than the whole system grinding to a halt. My use case however is to store master copies, particularly of things more sensitive to corruption such as source code.

bitrot, yabitrot, and hashdeep

In ports we can find the scripts bitrot and yabitrot. When run these will compute and store checksums for new files and files whose timestamps are newer than the time of the previous run. They also point out any files whose timestamp is not newer but whose checksum does not match.

I am accustomed to using a similar program called hashdeep (which I have patched to sum symlinks according to the string that is the link rather than the content of the file pointed to by the link).

It should be pointed out that there are some shortcomings of the approach taken by these tools:

  • These tools must be invoked manually or in a cron job. Manual invocation is inconvenient, and such tasks are easily forgotten or dropped entirely. A cron job requires the machine to be booted at the right time, and would need to be coordinated carefully with any users of the system so not to interrupt or confuse the running job.
  • If a corruption occurs between a file edit and the running of the tool then that corrupted data is marked as valid.
  • If the data corrupted includes the file metadata, particularly the timestamps, then the assumptions made by some of these tools are broken.
  • If a file is correctly removed from the collection then there is no way for the script to know that this is expected and will flag the removed file as missing until the checksum database is manually corrected.
  • These tools merely point out corruptions and require the user to manually restore files from their own backups.

I concede however that the simplicity of the scripts is a boon, and I recommend the reader to consider them for their intended use cases.

Since my use case had not been solved I decided to write my own solution, muxfs, to address this gap in OpenBSD's capabilities.

My solution: muxfs

I needed something that was not only automatic, but immediate, intercepting the data in-memory while its integrity could still be assured by ECC. It was clear that I needed a filesystem driver. After careful consideration I decided it would be wiser to write this first implementation as a FUSE filesystem. If muxfs proves itself then the FUSE implementation can serve as a stepping-stone towards a more efficient in-kernel filesystem driver.

Having read through the UFS and FFS driver source code I was inspired not to reinvent the wheel, rather to make use of these drivers indirectly. I chose to use directories as storage media instead of writing directly to block devices. This way source code is not duplicated, and muxfs gets the benefit of the many years of development and testing that have gone into whatever filesystem the user places at these directory locations.

It was now obvious that to get data redundancy I could simply feed writes coming into the muxfs driver across to multiple directories. If the user wants to spread the data across multiple disks, that would now be something they are in control of. In fact if the user wanted to diversify their storage as a hedge against unforeseen technical faults then they could simultaneously mirror to HDDs and SSDs each containing a variety of filesystem types.

I decided that I wanted both content and metadata to be checksummed, the checksum of the content to be considered part of the checksummed metadata, and checksums to be linked from content to metadata all the way up to the metadata checksum of the root directory. This means that the root directory's metadata checksum is representative of the whole filesystem tree contained inside that root directory.

Originally I was going to use SQLite to store the checksums, but I realised I didn't need it. Instead I store the equivalent of one table per file in hidden directories named .muxfs found at the root of each provided directory. All binary data in these "tables" are stored little-endian.

The design is to check all data read in from the directories before use. This way corruptions can be found and fixed opportunistically. It also prevents partial updates from inappropriately causing corrupted data nearby to be marked as valid.

A journal was not needed to address the write hole issue. muxfs writes to each directory in series, which guarantees that as long as there is more than one directory in the array there will always be at least one in a valid state. It is then simply a matter of running a muxfs sync to revert or update the interrupted directory to a valid state from one of the others.

I added an optimization for large files, which I define to be files of size larger than one muxfs native block size (currently 4096 bytes). Files are divided into blocks of this size with one content checksum per block, and in the case of large files there is a tree of such checksums joining subsets layer by layer until a single checksum representing the whole file is obtained. This way large files can be edited without muxfs needing to read and sum the whole file upon every write operation to it.

I chose to exclude timestamps from the checksummed metadata since I believe them to be too volatile to be worth tracking. Along the way I dropped support for hardlinking since this would need a means of discovering all paths from the root to a given file, and searching the whole tree upon every write operation would not have been acceptable. Special files and device nodes weren't a good fit for this filesystem either, so I dropped support for those too. Fortunately none of these features would be needed for the use case I had in mind, and if it were necessary to preserve records of such files then they could be archived with tar and the resulting tarball, being a regular file, would be supported. I am also open to adding support for these features if they are requested and if a sane proposal for how to support them is put together.

I wrote muxfs with source code audits in mind, so I have tried to keep the codebase small. The C source files, excluding comments and empty lines, total just over 6k lines at 80 characters maximum per line. The total for all files is a little under 9k lines. I have taken the fail fast and fail hard approach to error handling to make bugs more obvious. I have avoided dependencies wherever possible, and at the time of writing muxfs only depends on base. I am also pleased to say that muxfs compiles rather quickly, and the binary is about 100KiB in size.

Tutorial: muxfs storage array

In this example we will take an amd64 OpenBSD 7.1 system which has two unused SATA HDDs and use muxfs to turn the HDDs into a high-integrity storage array. Then we will temporarily attach an external USB HDD and use it to create an offline backup of this filesystem.

I will not cover installation of the OS or making the filesystem available over a network since these aspects should be no different from normal.

CAUTION: muxfs is not yet considered stable. Before you run these commands on your own system read the entire article, most importantly the section muxfs needs you!

First log in as root, create a workspace directory to hold the muxfs source code and build artifacts, and cd into it.

# mktemp -d
/tmp/tmp.Ab3De6Gh9J
# cd /tmp/tmp.Ab3De6Gh9J

Next we fetch the muxfs source code.

# curl -O 'https://sdadams.org/muxfs/muxfs-0.5-current.tgz.sha512'
# curl -O 'https://sdadams.org/muxfs/muxfs-0.5-current.tgz'

Check the sum of the downloaded tarball.

# sha512 -c muxfs-0.5-current.tgz.sha512
(SHA512) muxfs-0.5-current.tgz: OK

If this displays anything other than OK then the tarball is corrupted and should not be used.

Unpack the tarball and cd into it.

# tar -xzf muxfs-0.5-current.tgz
# cd muxfs-0.5-current

Build and install muxfs. Then check to ensure that the muxfs binary and its manual page are installed.

# make
echo  '/* gen.h contents not needed for unity build. */'  >gen.h
cc -std=c99 -pedantic -Wdeprecated -Wall -Wno-unused-function  -Werror -O2 -DNDE
BUG=1  -I.  -DMUXFS=static  -DMUXFS_DEC=static  -DMUXFS_DS_MALLOC=0  -Dmuxfs_chk
=muxfs_chk_p  -lfuse -lz  -o muxfs  unity.c
# make install
install -o root -g bin -m 0755 muxfs    /usr/local/sbin/muxfs
install -o root -g bin -m 0644 muxfs.1  /usr/local/man/man1/muxfs.1
# whereis muxfs
/usr/local/sbin/muxfs
# muxfs version
muxfs 0.5-current
# makewhatis
# apropos muxfs
muxfs(1) - the Multiplexed File System

Prepare a log file, then ensure that its mode is not readable by other, and that it is empty.

# install -o root -g wheel -m 0660 /dev/null /var/log/muxfs
# stat -f '%p %z' /var/log/muxfs
10660 0

Next we will need to edit syslog.conf(5), adding the following lines:

!muxfs
*.*	/var/log/muxfs

Then restart syslogd(8) to pick up the changes.

# rcctl restart syslogd
syslogd(ok)
syslogd(ok)

Now let's examine the disks on the system.

# sysctl hw.diskcount
hw.diskcount=3
# sysctl hw.disknames
hw.disknames=sd0:0123456789abcdef,sd1:,sd2:

This system has three disks: the boot disk, and the two spare HDDs. In this case we can see that only disk sd0 has a disk label. We can infer that sd0 is the boot disk, and that sd1 and sd2 are the two spare HDDs. Depending on your hardware your system may use wd instead of sd in the device names. If this is the case then the following commands should still work if you make the corresponding device name substitutions.

Each of the HDDs will need the following layers applied in order:

  1. Master Boot Record (MBR)
  2. OpenBSD Disk Label
  3. Berkley Fast File System (ffs)

CAUTION: The following commands will IRRECOVERABLY DELETE ALL DATA on the disks you apply them to. Back up anything you do not want to lose to other devices before you proceed.

Write an MBR to each of the disks.

# fdisk -iy sd1
Writing MBR at offset 0.
# fdisk -iy sd2
Writing MBR at offset 0.

Write a disk label to each of the disks.

# disklabel -E sd1
Label editor (enter '?' for help at any prompt)
sd1> a a
offset: [64]
size: *
FS type: [4.2BSD]
sd1*> w
sd1> q
No label changes.
# disklabel -E sd2
Label editor (enter '?' for help at any prompt)
sd2> a a
offset: [64]
size: *
FS type: [4.2BSD]
sd2*> w
sd2> q
No label changes.

Note down the new disk label IDs.

# sysctl hw.disknames
hw.disknames=sd0:0123456789abcdef,sd1:1123456789abcdef,sd2:2123456789abcdef

Create an ffs filesystem on each of the new disklabel partitions.

# newfs -t ffs sd1a
# newfs -t ffs sd2a

Create directories to serve as the mount points for the individual ffs filesystems, and ensure that they are only accessible by root.

# install -d -o root -g wheel -m 0555 /var/muxfs
# install -d -o root -g wheel -m 0700 /var/muxfs/a
# install -d -o root -g wheel -m 0700 /var/muxfs/b
# stat -f '%p' /var/muxfs/a
40700
# stat -f '%p' /var/muxfs/b
40700

This access restriction helps to prevent unwanted writes to the root filesystem if an error causes any of the filesystems to become unmounted.

Append the following lines to fstab(5) (replacing the 16-character IDs with those you noted from sysctl) to automatically mount the ffs filesystems at boot.

1123456789abcdef.a /var/muxfs/a ffs rw,nodev 0 2
2123456789abcdef.a /var/muxfs/b ffs rw,nodev 0 2

Note here that field 5 is 0, indicating that dump(8) should ignore these filesystems. Backing up the array using dump(8) would be wasteful, and we will cover a more appropriate backup mechanism, muxfs sync, below.

Request that the new filesystems in fstab(5) be mounted now, then check that they are mounted correctly.

# mount -a
# mount | grep 'sd[12]'
/dev/sd1a on /var/muxfs/a type ffs (local, nodev)
/dev/sd2a on /var/muxfs/b type ffs (local, nodev)

A muxfs array can be formatted to use either the crc32, md5, or sha1 checksum algorithm. muxfs format will use md5 by default, but we will select it explicitly here.

# muxfs format -a md5 /var/muxfs/a /var/muxfs/b

Each of the directories should now contain a hidden directory, .muxfs, which contains data internal to the functioning of muxfs.

Create a directory onto which the muxfs array will be mounted, and check its permissions, as was done for the ffs filesystems.

# install -d -o root -g wheel -m 0700 /mnt/storage
# stat -f '%p' /mnt/storage
40700

Now we can mount the muxfs array.

# muxfs mount /mnt/storage /var/muxfs/a /var/muxfs/b

Let's provoke muxfs to restore a file. First we create a file of test data:

# echo 'Example line of text.' >/mnt/storage/test.txt
# sync

We can now see that the file is present in both of the directories in the array:

# ls /var/muxfs/{a,b}
/var/muxfs/a:
.muxfs  test.txt

/var/muxfs/b:
.muxfs  test.txt
# cat /var/muxfs/{a,b}/test.txt
Example line of text.
Example line of text.

Corrupt the copy of the file in /var/muxfs/a by appending another line:

# echo 'Bad data.' >>/var/muxfs/a/test.txt

Then when we read the file from the muxfs mount-point we will see that muxfs is triggered to restore the file from the copy in /var/muxfs/b:

# cat /mnt/storage/test.txt
Example line of text.
# tail /var/log/muxfs
... Restoring: 0:/test.txt
... Restored: 0:/test.txt

When you're done using the filesystem unmount it by calling umount(8) on the mount-point.

# umount /mnt/storage

Backups and restoration

It is still important to make offline and off-site backups of your data even when using muxfs. For this we can use the sync command.

In the following commands I will assume that a detachable USB HDD has been formatted in a similar fashion to the individual HDDs in the array, and is mounted at /mnt/backup.

First ensure that your muxfs array is not mounted.

# umount /mnt/storage

Then to create or update another mirror at /mnt/backup use muxfs sync:

# muxfs sync /mnt/backup /var/muxfs/a /var/muxfs/b

This copy can be mounted as part of the array just like the other directories, and this process doubles as the means to replace a failed disk. Keep in mind that muxfs directories do not respond well to being moved across filesystems. muxfs uses inode numbers to match checksums to files, and these numbers cannot be preserved when copying or moving from one filesystem to another. Whenever you need to move data between filesystems remember to use the sync command.

Suppose a write to /var/muxfs/b is interrupted by a power outage. This should show up as a failure to mount. The state of /var/muxfs/b can be efficiently restored to that of /var/muxfs/a with:

# muxfs sync /var/muxfs/b /var/muxfs/a

If you then want to have muxfs read the whole array and report any corrupted files it finds on the standard output then you can use audit.

Again first ensure that your array is not mounted:

# umount /mnt/storage

Then issue the audit command:

# muxfs audit /var/muxfs/a /var/muxfs/b
/var/muxfs/a/path/to/corrupted1
/var/muxfs/a/path/to/corrupted2
/var/muxfs/b/path/to/corrupted3
/var/muxfs/b/path/to/corrupted4

Or if you want muxfs to attempt to restore the corrupted files it finds as it goes then use heal instead:

# muxfs heal /var/muxfs/a /var/muxfs/b
/var/muxfs/a/path/to/corrupted1
/var/muxfs/a/path/to/corrupted2
/var/muxfs/b/path/to/corrupted3
/var/muxfs/b/path/to/corrupted4

The success or failure of restoration attempts can be monitored via the log file.

# tail -f /var/log/muxfs
... Restoring: 0:/path/to/corrupted1
... Restored: 0:/path/to/corrupted1
... Restoring: 0:/path/to/corrupted2
... Restored: 0:/path/to/corrupted2
... Restoring: 1:/path/to/corrupted3
... Restored: 1:/path/to/corrupted3
... Restoring: 1:/path/to/corrupted4
... Restored: 1:/path/to/corrupted4

Here the number prefixed to the path is the index of the directory in the muxfs array.

You might be thinking "Fantastic! So my data is safe from bit-rot now, right?" Yes, well, almost...

muxfs needs you!

No filesystem can be considered stable without thorough testing and muxfs is no exception.

Even if I had tested muxfs enough to call it stable it still would not be responsible to expect you to simply take my word for it. It is for this reason that I do not intend to release a version 1.0 until there are sufficient citations that I can make to positive, third-party evaluations of muxfs.

This is where you can help.

I need volunteers to test muxfs, provide feedback, and periodically publish test results. The types of things we need to know about include: ease of use, clarity of documentation, needed additional features, bugs, performance, and security issues. Try to be creative in your approach to testing. The more angles we approach it from, the more stable it will become.

I do not recommend testing muxfs on a machine containing sensitive data, or with privileged access to other systems. Instead I recommend to run muxfs on a dedicated machine, whether physical or virtual.

muxfs may still have bugs, and expects to run as root. pledge(2) and unveil(2) have not yet been applied since I would prefer to get feedback on usability, and whether any additional features are needed, before I lock it down. I would prefer for the policy to be well-known rather than changing from version to version. This said, I am open to discussion on applying these sooner if requested. In the mean time your own policy can be imposed with a small patch to main() in muxfs.c.

If you insist on testing muxfs with important data then please use muxfs as a mirror downstream of some other well-established storage system, and periodically compare the data against the upstream to find discrepancies.

Feedback can be sent to <muxfs@sdadams.org>. For convenience I have enabled the Discussions feature in the github mirror. I also plan to spend some time in the #openbsd irc channel on irc.libera.chat; my nick there is sdadams.