Ideas for a Log-Structured/Copy-on-Write File System

These notes on the design of an LFS just record my thoughts for the time when I find a student who wants to implement a file system. However, if you find any of the ideas useful, feel free to implement them; maybe we can then write a paper together.

Note that some people do not classify this kind of file system as log-structured (see Free Space Management); e.g., Valerie Henson classifies such file systems as copy-on-write file systems. However, throughout this page the term "log-structured" is used.

Benefits

What are today's file systems missing? Are they not good enough? What does the proposed file system offer that existing file systems do not have?

Data consistency

Most file systems today only guarantee meta-data consistency in case of a crash or power outage. That means that, in case of a crash, you will find that the directories will be mostly preserved, but the data in the files might be lost completely (and yes, this has happened to me). Fortunately, this does not happen to files that have not been written to or changed for a while before the crash, and crashes are rare enough that many people don't worry much about this.

But if an application requires that files or groups of files are internally consistent, this cavalier attitude to data consistency is not satisfactory. Some people say that the application should use fsync() to ensure consistency, but even if the application programmers try to follow this advice, they find that it is hard to test, because it can only be fully tested by pulling the plug on the machine. And if a feature is hard to test and rarely used, it will often have bugs.

The consistency guarantee we wish for is in-order semantics: the state of the file system after recovery represents all write()s (or other changes) that occurred before a specific point in time, and no write() (or other change) that occurred afterwards. I.e., at most you lose a minute or so of work (and you can eliminate that with fsync()/sync() when needed). This means that if an application ensures file data consistency in case of its own unexpected termination by performing writes and other file changes in the right order, its file data will also be consistent (but possibly not up-to-date) in case of a system failure. Consistency through unexpected termination is much easier to test than consistency through a system crash, so it is much more likely that programmers will get it right.

Most file systems today do not give the in-order semantics guarantee. In general they give hardly any guarantee about data consistency, just about metadata consistency. This includes most journaling file systems (most of them offer just meta-data journaling from the start, and in ext3, which used to offer data journaling, this feature is broken in many Linux versions); and those that offer it, suffer a substantial performance penalty, as all data has to be written twice.

Reiser4 uses LFS techniques (called Wandering Logs in Reiser4) and other techniques to achieve very good consistency guarantees (it allows full transactions via file system plugins), although it's not completely clear if it gives the in-order semantics consistency guarantee. However, Reiser4 does not offer snapshots or clones (yet:-). It is also a much more complex file system, and thus probably less amenable to studying variations in the design decisions.

Snapshots

A snapshot of a file system reflects the state that the file system was in at a specific point in time.

Creating a snapshot of a file system is very useful for making consistent backups of the file system without disabling read access to it. They also allow access to files that have been deleted or changed in the mainline.

LVMs offer snapshot functionality somewhat independently of the file system (some interaction is required to ensure that the snapshot is of a consistent file system). The disadvantage of an LVM is that it is less efficient in:

time
When a block is written, the LVM copies the original elsewhere to keep it for the snapshot; in contrast, an LFS just writes the new block into an unused block. The LVM approach probably results in a slowdown by at least a factor of 2-3 on large writes, and maybe more. Consequently, running with an LVM snapshot is not something that you would want to do all the time (e.g., to give the users some safety net against their own mistakes).
space
The LVM does not know which blocks of the file system are free, so it has to use additional space for keeping the snapshot data, even if there is lots of space left in the file system. Another way to look at it, is that we can give that space to the original file system and it can make productive use of it when it is not needed for snapshots.

Clones

A clone is a writable snapshot. It is useful for experimentally testing, e.g., installation of new software without risking the existing setup, and while keeping the existing setup usable.

What about performance?

Originally the main selling point of log-structured file systems was performance, in particular write performance (caching was supposed to make read performance a non-issue).

However, the write performance of file systems like ext2 is so good that it is probably hard to beat, and not by much; for most uses, write performance is not a big issue. The main contributions to write performance over BSD FFS have been completely asynchronous operations and improved block allocation policies.

Therefore, wrt performance the goal of the present project is just to stay competetive with file systems like ext2 and ext3 on typical workloads (and maybe exceptionally good or bad on artificial workloads). As a consequence, some of the design decisions in the present project are quite different from those set out in early LFS work.

Ideas

To understand the following, you will probably find it helpful to read the first three sections of LinLogFS - A Log-Structured Filesystem For Linux first.

Free Space management

The original LFS proposals and implementations dropped traditional free-space management (based on maintaining a free-blocks set) in favour of segment-based allocation combined with a cleaner (a kind of copying garbage collector) for reclaiming segments. The rationale was that this would lead to better performance because it would eliminate most seek overheads during writing.

I propose going back to the free-blocks set, for the following reasons:

You may wonder why one would still call the resulting file system a log-structured file system. That's because it still has the property of performing no in-place updates: All writes are made to free blocks (and overwritten blocks are not freed before the new blocks are commited).

Clones and reference counting

In the presence of snapshots and clones, a block may belong to several clones, so freeing it in one clone does not necessarily free it overall. If garbage collection (cleaning) is not the solution, then the usual fall-back in programming languages is reference counting, and that approach is already well-established in file systems (i.e., inode counts), and that seems to be the right solution. Reference counting has been used successfully in a copy-on-write style system.

Reference counting incurs the following costs: Clone creation is cheap. Writing to a block with a count >1 may be more expensive by a constant factor than writing a plain block; the worst case is if a fully-used indirect block is written to; then the reference counts of all pointed-to blocks have to be updated (e.g., 1024 for a 4KB indirect block with 4-byte block numbers). Clone deletion cost is proportional to the number of blocks unique to the clone. These costs seem acceptable.

Storing commit points

Once we eliminate segment-based memory allocation, how do we find commit points? One way would be to update the superblock(s) every time we write a commit point. This is slow, especially for loads with small synchronous writes.

An alternative is to allocate several extents (block ranges) distributed on the disk for commit points and record these extents in the superblock(s). Then we can use these extents for many commit points before having to update the superblocks again.

By using several extents distributed across the disk, we can write a commit block close to the data and meta-data of the files described in the commit block, limiting the seek time during writing.

One advantage of this approach is that we can read lots of commit points with a few seeks, requiring relatively rare roll-forward start points; however, to take full advantage of this requires organizing the on-disk information (especially the change records) and the in-memory data structures in a way that does not require many seeks on mounting.

To ensure that we recognize only commit points as commit points, the file system has to clear the extent before using it for commit points. Time stamps in the commit points are used to determine their order.

So, overall the order of operations is:

clear an extent of blocks, allocate them for commit points
commit
update the superblocks
start writing commit points to the extent
Of course, at some point the FS has to free an extent (to allow adding a new one to the superblock). This can be achieved by either producing a roll-forward start block (this frees several commit point extents), or by consolidating commit information from that extent into the other extents in some other way (this is probably not worth doing).

One variation on this idea would be to allocate and clear extents for commit blocks on file system creation, and never change the location of these extents, making writes to the superblock and later clearing of the extent unnecessary. In this variant only freeing of individual commit blocks (instead of whole extents) would be necessary. Problem: How to deal with bad blocks (if at all)?

Defragmentation

Even with a free-block space management, an LFS can lead to fragmentation in some cases where a conventional file system does not fragment, in particular, when partially overwriting files. OTOH, it is also quite easy to defragment in an LFS. One way to deal with fragmentation is to transparently create a new copy of the file.

This is especially cheap when the file system has just read the file, as the contents are still in the cache; this is also the time when we notice without additional effort that the file is fragmented.

As a special case of this kind of defragmentation, an overwritten block of an existing file may be written back to the place where the original block resided, if that block is now free. This is somewhat similar to how journaling file systems work.

Snapshots and clones present a special problem in that context: They will be used for making backups, and most fragmeted files will probably be noticed during backup; So the file system wants to defragment in the snapshot/clone. If done so, this would inhibit defragmentation in the user-side snapshot/clone. How do we make the defragmentation affect all heads?

Set-based Free-space management

For a long time I had a blind eye for reference counting and was thinking of having a separate set of free blocks for each clone; I guess that the WAFL paper (which is otherwise a very good introduction, but does it that way) led me astray, and probably also a dislike of reference counting from my programming language background.

So the following are some ideas for how to make such a set-per-clone approach workable in a more general setting than WAFL. They might be useful if you need to know which clones a block belongs to, how many blocks a clone would occupy by itself, etc. But as long as you don't know you need this, this is probably not worth reading.

Free-space management and snapshots

This subsection should be rewritten to use the concepts of the next one, for better presentation.

At each snapshot we have a set of free blocks (and complementary, a set of allocated blocks) for the files reachable through this snapshot. The free block set for the whole file system is the intersection of the free block sets of all the snapshots (including the head of the file system).

How do we get this global free-blocks set when we need it (in particular, for allocating blocks for a file)? We maintain it all the time, and update it when something changes. The changes that are of interest are:

Allocating a block works as usual, and creating a snapshot has no immediate effect on the free-block set; when freeing a block, we free it in the free list of the head (and I'll explain a little later how this affects the global free list). The most interesting operation is destroying a snapshot.

As long as we have only read-only snapshots, we can view the file system as a linear sequence of snapshots. Between two snapshots, a set of blocks are allocated (born), and a set of blocks are freed (killed). The FS maintains the set of blocks that were born and killed since the last snapshot. This leads to the following maintenance actions:

allocate block
free &= ~{block}
head.born |= {block}
free block
if (block in head.born)
  head.born &= ~{block}
  free |= {block}
else
  head.killed |= {block}
create snapshot
snapshot.born=head.born
snapshot.killed=head.killed
head.born = {}
head.killed = {}
destroy snapshot
free |= snapshot.born & snapshot.next.killed
snapshot.next.born |= snapshot.born &~ snapshot.next.killed
snapshot.next.killed &= ~snapshot.born
snapshot.next.killed |= snapshot.killed

Picture: Snapshots and free-block sets

So, overall, maintaining the free list across all snapshots is relatively simple, allocating and freeing blocks in the head cost not much more than in the simple case, creating a snapshot is cheap, and the cost of destroying a snapshot depends on how much allocation/freeing happened between the adjacent snapshots, which should be acceptable.

Some invariants for all snapshots (including head):

It may be possible to be lazy with some of the operations, and recreate the proper state of the various sets (actually we only need the free-block set) through these invariants when needed.

Free-space management and clones

Clones complicate the issue, because the linear sequence of snapshots with one writable head turns now into a tree with several writable heads (the leaves of the tree). So instead of snapshots and a head we now think about clone points (where the history of the file system forks into two clones), several heads, and edges from the start or a clone point to another clone point or a head (it's probably better to use the edges as separate concept in the linear case, too).

Picture: Clones, clone points, and heads

The invariants above change; in particular, the killed sets of alternative edges can overlap (the same block can be freed in several alternative clones).

The relevant file system operations now are:

The associated maintenance actions are:
allocate block for head
free &= ~{block}
head.born |= {block}
free block from head
if (block in head.born)
  head.born &= ~{block}
  free |= {block}
else
  head.killed |= {block}
  if (block in all head.killed)
    free |= {block}
    all head.killed &= ~{block} /* necessary? correct? */
clone head (producing head1, head2, and clonepoint)
clonepoint.born=head.born
clonepoint.killed=head.killed
head1.born = {}
head1.killed = {}
head2.born = {}
head2.killed = {}
delete head and its clonepoint, thus fusing the clonepoint and brother edges into brother
free |= head.born | (clonepoint.born & brother.killed)
brother.born |= clonepoint.born &~ brother.killed
brother.killed &= ~clonepoint.born
brother.killed |= clonepoint.killed

Set representation

We have to represent the various sets (in particular the free-blocks set) permanently on disk, in the commit blocks, and in memory.

The operations to perform are as outlined above; in addition, there has to be a way to quickly find a mostly-contiguous set of blocks (this is probably already solved well in current file systems that are based on free-block sets).

The representations of sets with few elements should be relatively small, and set operations involving few elements on one or both sides should be relatively cheap. Also, reading partial/incremental sets from commit points during mounting should not require many seeks to complete the mount (either by having a data structure that does not require much seeking, or by performing the seeking lazily after mounting).

Probably some tree-based set implementation is a good idea for the on-disk data structure, with incremental change records for commit points; for most in-memory sets (which will probably be relatively small compared to the number of total blocks), such a tree will probably be a good idea, too.

Links

LinLogFS - A Log-Structured Filesystem For Linux | What's wrong with synchronous metadata updates?
Anton Ertl