Introducing muxfs
a mirroring, checksumming, and self-healing filesystem layer for OpenBSD
Date published: 2022-08-08
Author: Stephen D. Adams <stephen@sdadams.org>
The problem: data corruption
I decided it was finally time to build a file server to centralize my files and guard them against bit-rot. Although I would have preferred to use OpenBSD due to its straightforward configuration and sane defaults, I was surprised to find that none of the typical NAS filesystems were supported.
In particular I would need three features in such a filesystem. Firstly I would need data to be stored redundantly across multiple disks so should one disk fail there would be at least one other disk from which the data can be recovered. Secondly I would need all data and metadata to have their checksums stored alongside them so that if one disk yields corrupted data then the valid and invalid copies would be identifiable and restoration could proceed without the risk of propagating the corruption. Finally such a filesystem should automatically check and repair data as it is accessed rather than processing the entire filesystem tree upon every check or repair job.
For this final point it is not that checking the entire tree should not be possible, rather that for this expensive task it should be necessary neither to manually invoke it frequently, nor to cron schedule it to be invoked frequently. The inconvenience, and time and energy wasted aside regularly processing through the entire contents of a disk needlessly shortens its lifespan.
Solutions I considered
Hardware RAID and softraid
There is already plenty of discussion on the internet concerning the viability of RAID as a bit-rot mitigation tool so I will not go into much detail here, but for those unaware I suggest to at least read up on the term "write hole".
The more expensive RAID cards can address such issues but might simply shift the problem over to, for example, the additional maintenance burden of checking and replacing the on-card batteries.
OpenBSD's softraid(4)
is close to a solution to the problem, but we can see the following from the manual page:
- "The driver relies on underlying hardware to properly fail chunks."
- "Currently there is no automated mechanism to recover from failed disks."
- "Certain RAID levels can protect against some data loss due to component failure."
Note that the data protected from loss is qualified as some. There is also no mention of support for a journal device that would be needed to address the write hole issue.
If my use case were a webserver serving many images or videos to a large userbase then I would store the content using softraid
since small-but-rare corruptions to the content would be significantly less bad than the whole system grinding to a halt.
My use case however is to store master copies, particularly of things more sensitive to corruption such as source code.
bitrot
, yabitrot
, and hashdeep
In ports we can find the scripts bitrot
and yabitrot
.
When run these will compute and store checksums for new files and files whose timestamps are newer than the time of the previous run. They also point out any files whose timestamp is not newer but whose checksum does not match.
I am accustomed to using a similar program called hashdeep
(which I have patched to sum symlinks according to the string that is the link rather than the content of the file pointed to by the link).
It should be pointed out that there are some shortcomings of the approach taken by these tools:
- These tools must be invoked manually or in a cron job. Manual invocation is inconvenient, and such tasks are easily forgotten or dropped entirely. A cron job requires the machine to be booted at the right time, and would need to be coordinated carefully with any users of the system so not to interrupt or confuse the running job.
- If a corruption occurs between a file edit and the running of the tool then that corrupted data is marked as valid.
- If the data corrupted includes the file metadata, particularly the timestamps, then the assumptions made by some of these tools are broken.
- If a file is correctly removed from the collection then there is no way for the script to know that this is expected and will flag the removed file as missing until the checksum database is manually corrected.
- These tools merely point out corruptions and require the user to manually restore files from their own backups.
I concede however that the simplicity of the scripts is a boon, and I recommend the reader to consider them for their intended use cases.
Since my use case had not been solved I decided to write my own solution, muxfs
, to address this gap in OpenBSD's capabilities.
My solution: muxfs
I needed something that was not only automatic, but immediate, intercepting the data in-memory while its integrity could still be assured by ECC.
It was clear that I needed a filesystem driver.
After careful consideration I decided it would be wiser to write this first implementation as a FUSE filesystem.
If muxfs
proves itself then the FUSE implementation can serve as a stepping-stone towards a more efficient in-kernel filesystem driver.
Having read through the UFS and FFS driver source code I was inspired not to reinvent the wheel, rather to make use of these drivers indirectly.
I chose to use directories as storage media instead of writing directly to block devices.
This way source code is not duplicated, and muxfs
gets the benefit of the many years of development and testing that have gone into whatever filesystem the user places at these directory locations.
It was now obvious that to get data redundancy I could simply feed writes coming into the muxfs
driver across to multiple directories.
If the user wants to spread the data across multiple disks, that would now be something they are in control of.
In fact if the user wanted to diversify their storage as a hedge against unforeseen technical faults then they could simultaneously mirror to HDDs and SSDs each containing a variety of filesystem types.
I decided that I wanted both content and metadata to be checksummed, the checksum of the content to be considered part of the checksummed metadata, and checksums to be linked from content to metadata all the way up to the metadata checksum of the root directory. This means that the root directory's metadata checksum is representative of the whole filesystem tree contained inside that root directory.
Originally I was going to use SQLite to store the checksums, but I realised I didn't need it.
Instead I store the equivalent of one table per file in hidden directories named .muxfs
found at the root of each provided directory.
All binary data in these "tables" are stored little-endian.
The design is to check all data read in from the directories before use. This way corruptions can be found and fixed opportunistically. It also prevents partial updates from inappropriately causing corrupted data nearby to be marked as valid.
A journal was not needed to address the write hole issue.
muxfs
writes to each directory in series, which guarantees that as long as there is more than one directory in the array there will always be at least one in a valid state.
It is then simply a matter of running a muxfs sync
to revert or update the interrupted directory to a valid state from one of the others.
I added an optimization for large files, which I define to be files of size larger than one muxfs
native block size (currently 4096 bytes).
Files are divided into blocks of this size with one content checksum per block, and in the case of large files there is a tree of such checksums joining subsets layer by layer until a single checksum representing the whole file is obtained.
This way large files can be edited without muxfs
needing to read and sum the whole file upon every write operation to it.
I chose to exclude timestamps from the checksummed metadata since I believe them to be too volatile to be worth tracking.
Along the way I dropped support for hardlinking since this would need a means of discovering all paths from the root to a given file, and searching the whole tree upon every write operation would not have been acceptable.
Special files and device nodes weren't a good fit for this filesystem either, so I dropped support for those too.
Fortunately none of these features would be needed for the use case I had in mind, and if it were necessary to preserve records of such files then they could be archived with tar
and the resulting tarball, being a regular file, would be supported.
I am also open to adding support for these features if they are requested and if a sane proposal for how to support them is put together.
I wrote muxfs
with source code audits in mind, so I have tried to keep the codebase small.
The C source files, excluding comments and empty lines, total just over 6k lines at 80 characters maximum per line.
The total for all files is a little under 9k lines.
I have taken the fail fast and fail hard approach to error handling to make bugs more obvious.
I have avoided dependencies wherever possible, and at the time of writing muxfs
only depends on base.
I am also pleased to say that muxfs
compiles rather quickly, and the binary is about 100KiB in size.
Tutorial: muxfs
storage array
In this example we will take an amd64 OpenBSD 7.1 system which has two unused SATA HDDs and use muxfs
to turn the HDDs into a high-integrity storage array.
Then we will temporarily attach an external USB HDD and use it to create an offline backup of this filesystem.
I will not cover installation of the OS or making the filesystem available over a network since these aspects should be no different from normal.
CAUTION: muxfs
is not yet considered stable.
Before you run these commands on your own system read the entire article, most importantly the section muxfs
needs you!
First log in as root, create a workspace directory to hold the muxfs
source code and build artifacts, and cd
into it.
# mktemp -d /tmp/tmp.Ab3De6Gh9J
# cd /tmp/tmp.Ab3De6Gh9J
Next we fetch the muxfs source code.
# curl -O 'https://sdadams.org/muxfs/muxfs-0.5-current.tgz.sha512' # curl -O 'https://sdadams.org/muxfs/muxfs-0.5-current.tgz'
Check the sum of the downloaded tarball.
# sha512 -c muxfs-0.5-current.tgz.sha512 (SHA512) muxfs-0.5-current.tgz: OK
If this displays anything other than OK
then the tarball is corrupted and should not be used.
Unpack the tarball and cd
into it.
# tar -xzf muxfs-0.5-current.tgz # cd muxfs-0.5-current
Build and install muxfs.
Then check to ensure that the muxfs
binary and its manual page are installed.
# make echo '/* gen.h contents not needed for unity build. */' >gen.h cc -std=c99 -pedantic -Wdeprecated -Wall -Wno-unused-function -Werror -O2 -DNDE BUG=1 -I. -DMUXFS=static -DMUXFS_DEC=static -DMUXFS_DS_MALLOC=0 -Dmuxfs_chk =muxfs_chk_p -lfuse -lz -o muxfs unity.c
# make install install -o root -g bin -m 0755 muxfs /usr/local/sbin/muxfs install -o root -g bin -m 0644 muxfs.1 /usr/local/man/man1/muxfs.1
# whereis muxfs /usr/local/sbin/muxfs
# muxfs version muxfs 0.5-current
# makewhatis # apropos muxfs muxfs(1) - the Multiplexed File System
Prepare a log file, then ensure that its mode is not readable by other, and that it is empty.
# install -o root -g wheel -m 0660 /dev/null /var/log/muxfs # stat -f '%p %z' /var/log/muxfs 10660 0
Next we will need to edit syslog.conf(5)
, adding the following lines:
!muxfs *.* /var/log/muxfs
Then restart syslogd(8)
to pick up the changes.
# rcctl restart syslogd syslogd(ok) syslogd(ok)
Now let's examine the disks on the system.
# sysctl hw.diskcount hw.diskcount=3
# sysctl hw.disknames hw.disknames=sd0:0123456789abcdef,sd1:,sd2:
This system has three disks: the boot disk, and the two spare HDDs.
In this case we can see that only disk sd0
has a disk label.
We can infer that sd0
is the boot disk, and that sd1
and sd2
are the two spare HDDs.
Depending on your hardware your system may use wd
instead of sd
in the device names.
If this is the case then the following commands should still work if you make the corresponding device name substitutions.
Each of the HDDs will need the following layers applied in order:
- Master Boot Record (MBR)
- OpenBSD Disk Label
- Berkley Fast File System (
ffs
)
CAUTION: The following commands will IRRECOVERABLY DELETE ALL DATA on the disks you apply them to. Back up anything you do not want to lose to other devices before you proceed.
Write an MBR to each of the disks.
# fdisk -iy sd1 Writing MBR at offset 0.
# fdisk -iy sd2 Writing MBR at offset 0.
Write a disk label to each of the disks.
# disklabel -E sd1 Label editor (enter '?' for help at any prompt) sd1> a a offset: [64] size: * FS type: [4.2BSD] sd1*> w sd1> q No label changes.
# disklabel -E sd2 Label editor (enter '?' for help at any prompt) sd2> a a offset: [64] size: * FS type: [4.2BSD] sd2*> w sd2> q No label changes.
Note down the new disk label IDs.
# sysctl hw.disknames hw.disknames=sd0:0123456789abcdef,sd1:1123456789abcdef,sd2:2123456789abcdef
Create an ffs
filesystem on each of the new disklabel
partitions.
# newfs -t ffs sd1a # newfs -t ffs sd2a
Create directories to serve as the mount points for the individual ffs
filesystems, and ensure that they are only accessible by root.
# install -d -o root -g wheel -m 0555 /var/muxfs # install -d -o root -g wheel -m 0700 /var/muxfs/a # install -d -o root -g wheel -m 0700 /var/muxfs/b # stat -f '%p' /var/muxfs/a 40700 # stat -f '%p' /var/muxfs/b 40700
This access restriction helps to prevent unwanted writes to the root filesystem if an error causes any of the filesystems to become unmounted.
Append the following lines to fstab(5)
(replacing the 16-character IDs with those you noted from sysctl) to automatically mount the ffs
filesystems at boot.
1123456789abcdef.a /var/muxfs/a ffs rw,nodev 0 2 2123456789abcdef.a /var/muxfs/b ffs rw,nodev 0 2
Note here that field 5 is 0
, indicating that dump(8)
should ignore these filesystems. Backing up the array using dump(8)
would be wasteful, and we will cover a more appropriate backup mechanism, muxfs sync
, below.
Request that the new filesystems in fstab(5)
be mounted now, then check that they are mounted correctly.
# mount -a
# mount | grep 'sd[12]' /dev/sd1a on /var/muxfs/a type ffs (local, nodev) /dev/sd2a on /var/muxfs/b type ffs (local, nodev)
A muxfs
array can be formatted to use either the crc32
, md5
, or sha1
checksum algorithm. muxfs format
will use md5
by default, but we will select it explicitly here.
# muxfs format -a md5 /var/muxfs/a /var/muxfs/b
Each of the directories should now contain a hidden directory, .muxfs
, which contains data internal to the functioning of muxfs
.
Create a directory onto which the muxfs
array will be mounted, and check its permissions, as was done for the ffs
filesystems.
# install -d -o root -g wheel -m 0700 /mnt/storage
# stat -f '%p' /mnt/storage 40700
Now we can mount the muxfs
array.
# muxfs mount /mnt/storage /var/muxfs/a /var/muxfs/b
Let's provoke muxfs
to restore a file.
First we create a file of test data:
# echo 'Example line of text.' >/mnt/storage/test.txt # sync
We can now see that the file is present in both of the directories in the array:
# ls /var/muxfs/{a,b} /var/muxfs/a: .muxfs test.txt /var/muxfs/b: .muxfs test.txt
# cat /var/muxfs/{a,b}/test.txt Example line of text. Example line of text.
Corrupt the copy of the file in /var/muxfs/a
by appending another line:
# echo 'Bad data.' >>/var/muxfs/a/test.txt
Then when we read the file from the muxfs
mount-point we will see that muxfs
is triggered to restore the file from the copy in /var/muxfs/b
:
# cat /mnt/storage/test.txt Example line of text.
# tail /var/log/muxfs ... Restoring: 0:/test.txt ... Restored: 0:/test.txt
When you're done using the filesystem unmount it by calling umount(8)
on the mount-point.
# umount /mnt/storage
Backups and restoration
It is still important to make offline and off-site backups of your data even when using muxfs
.
For this we can use the sync
command.
In the following commands I will assume that a detachable USB HDD has been formatted in a similar fashion to the individual HDDs in the array, and is mounted at /mnt/backup
.
First ensure that your muxfs
array is not mounted.
# umount /mnt/storage
Then to create or update another mirror at /mnt/backup
use muxfs sync
:
# muxfs sync /mnt/backup /var/muxfs/a /var/muxfs/b
This copy can be mounted as part of the array just like the other directories, and this process doubles as the means to replace a failed disk.
Keep in mind that muxfs
directories do not respond well to being moved across filesystems.
muxfs
uses inode numbers to match checksums to files, and these numbers cannot be preserved when copying or moving from one filesystem to another.
Whenever you need to move data between filesystems remember to use the sync
command.
Suppose a write to /var/muxfs/b
is interrupted by a power outage.
This should show up as a failure to mount
.
The state of /var/muxfs/b
can be efficiently restored to that of /var/muxfs/a
with:
# muxfs sync /var/muxfs/b /var/muxfs/a
If you then want to have muxfs
read the whole array and report any corrupted files it finds on the standard output then you can use audit
.
Again first ensure that your array is not mounted:
# umount /mnt/storage
Then issue the audit
command:
# muxfs audit /var/muxfs/a /var/muxfs/b /var/muxfs/a/path/to/corrupted1 /var/muxfs/a/path/to/corrupted2 /var/muxfs/b/path/to/corrupted3 /var/muxfs/b/path/to/corrupted4
Or if you want muxfs
to attempt to restore the corrupted files it finds as it goes then use heal
instead:
# muxfs heal /var/muxfs/a /var/muxfs/b /var/muxfs/a/path/to/corrupted1 /var/muxfs/a/path/to/corrupted2 /var/muxfs/b/path/to/corrupted3 /var/muxfs/b/path/to/corrupted4
The success or failure of restoration attempts can be monitored via the log file.
# tail -f /var/log/muxfs ... Restoring: 0:/path/to/corrupted1 ... Restored: 0:/path/to/corrupted1 ... Restoring: 0:/path/to/corrupted2 ... Restored: 0:/path/to/corrupted2 ... Restoring: 1:/path/to/corrupted3 ... Restored: 1:/path/to/corrupted3 ... Restoring: 1:/path/to/corrupted4 ... Restored: 1:/path/to/corrupted4
Here the number prefixed to the path is the index of the directory in the muxfs
array.
You might be thinking "Fantastic! So my data is safe from bit-rot now, right?" Yes, well, almost...
muxfs
needs you!
No filesystem can be considered stable without thorough testing and muxfs
is no exception.
Even if I had tested muxfs
enough to call it stable it still would not be responsible to expect you to simply take my word for it.
It is for this reason that I do not intend to release a version 1.0 until there are sufficient citations that I can make to positive, third-party evaluations of muxfs
.
This is where you can help.
I need volunteers to test muxfs
, provide feedback, and periodically publish test results.
The types of things we need to know about include: ease of use, clarity of documentation, needed additional features, bugs, performance, and security issues.
Try to be creative in your approach to testing.
The more angles we approach it from, the more stable it will become.
I do not recommend testing muxfs
on a machine containing sensitive data, or with privileged access to other systems.
Instead I recommend to run muxfs
on a dedicated machine, whether physical or virtual.
muxfs
may still have bugs, and expects to run as root.
pledge(2)
and unveil(2)
have not yet been applied since I would prefer to get feedback on usability, and whether any additional features are needed, before I lock it down.
I would prefer for the policy to be well-known rather than changing from version to version.
This said, I am open to discussion on applying these sooner if requested.
In the mean time your own policy can be imposed with a small patch to main()
in muxfs.c.
If you insist on testing muxfs
with important data then please use muxfs
as a mirror downstream of some other well-established storage system, and periodically compare the data against the upstream to find discrepancies.
Feedback can be sent to <muxfs@sdadams.org>. For convenience I have enabled the Discussions feature in the github mirror. I also plan to spend some time in the #openbsd irc channel on irc.libera.chat; my nick there is sdadams.