Ideas for a Log-Structured/Copy-on-Write File System

<title>Ideas for a Log-Structured/Copy-on-Write File System</title>

<h1>Ideas for a Log-Structured/Copy-on-Write File System</h1>

These notes on the design of an LFS just record my thoughts for
the time when I find a student who wants to implement a file system.
However, if you find any of the ideas useful, feel free to implement
them; maybe we can then write a paper together.

<p>Note that some people do not classify this kind of file system as
log-structured (see <a href="#free-space-management">Free Space
Management</a>); e.g., <a
href="http://lwn.net/Articles/196292/">Valerie Henson classifies such
file systems as copy-on-write file systems</a>.  However, throughout
this page the term "log-structured" is used.

<h2>Benefits</h2>

What are today's file systems missing?  Are they not good enough?
What does the proposed file system offer that existing file systems do
not have?

<h3>Data consistency</h3>

Most file systems today only guarantee meta-data consistency in case
of a crash or power outage.  That means that, in case of a crash, you
will find that the directories will be mostly preserved, but the data
in the files might be lost completely (and yes, <a
href="http://www.complang.tuwien.ac.at/anton/sync-metadata-updates.html">this
has happened to me</a>).  Fortunately, this does not happen to files
that have not been written to or changed for a while before the crash,
and crashes are rare enough that many people don't worry much about this.

<p>But if an application requires that files or groups of files are
internally consistent, this cavalier attitude to data consistency is
not satisfactory.  Some people say that the application should use
fsync() to ensure consistency, but even if the application programmers
try to follow this advice, they find that it is hard to test, because
it can only be fully tested by pulling the plug on the machine.  And
if a feature is hard to test and rarely used, it will often have bugs.

<p>The consistency guarantee we wish for is <a
href="http://www.complang.tuwien.ac.at/papers/czezatke%26ertl00/#sect-in-order"><strong>in-order
semantics</strong></a>: the state of the file system after recovery
represents all write()s (or other changes) that occurred before a
specific point in time, and no write() (or other change) that occurred
afterwards. I.e., at most you lose a minute or so of work (and you can
eliminate that with fsync()/sync() when needed).  This means that if
an application ensures file data consistency in case of its own
unexpected termination by performing writes and other file changes in
the right order, its file data will also be consistent (but possibly
not up-to-date) in case of a system failure.  Consistency through
unexpected termination is much easier to test than consistency through
a system crash, so it is much more likely that programmers will get
it right.

<p>Most file systems today do not give the in-order semantics
guarantee.  In general they give hardly any guarantee about data
consistency, just about metadata consistency.  This includes most
journaling file systems (most of them offer just meta-data journaling
from the start, and in ext3, which used to offer data journaling, this
feature is broken in many Linux versions); and those that offer it,
suffer a substantial performance penalty, as all data has to be
written twice.

<p><a href="http://www.namesys.com/v4/v4.html">Reiser4</a> uses LFS
techniques (called <em>Wandering Logs</em> in Reiser4) and other
techniques to achieve very good consistency guarantees (it allows full
transactions via file system plugins), although it's not completely
clear if it gives the in-order semantics consistency guarantee.
However, Reiser4 does not offer snapshots or clones (yet:-).  It is
also a much more complex file system, and thus probably less amenable
to studying variations in the design decisions.

<h3>Snapshots</h3>

A snapshot of a file system reflects the state that the file system
was in at a specific point in time.

<p>Creating a snapshot of a file system is very useful for making
consistent backups of the file system without disabling read access to
it.  They also allow access to files that have been deleted or changed
in the mainline.

<p>LVMs offer snapshot functionality somewhat independently of the
file system (some interaction is required to ensure that the snapshot
is of a consistent file system).  The disadvantage of an LVM is that
it is less efficient in:

<dl>

<dt>time<dd>When a block is written, the LVM copies the original
elsewhere to keep it for the snapshot; in contrast, an LFS just writes
the new block into an unused block.  The LVM approach probably results
in a slowdown by at least a factor of 2-3 on large writes, and maybe
more.  Consequently, running with an LVM snapshot is not something
that you would want to do all the time (e.g., to give the users some
safety net against their own mistakes).

<dt>space<dd>The LVM does not know which blocks of the file system are
free, so it has to use additional space for keeping the snapshot data,
even if there is lots of space left in the file system.  Another way
to look at it, is that we can give that space to the original file
system and it can make productive use of it when it is not needed for
snapshots.

</dl>

<h3>Clones</h3>

A clone is a writable snapshot.  It is useful for experimentally
testing, e.g., installation of new software without risking the
existing setup, and while keeping the existing setup usable.

<h3>What about performance?</h3>

Originally the main selling point of log-structured file systems was
performance, in particular write performance (caching was supposed to
make read performance a non-issue).

<p>However, the write performance of file systems like ext2 is so good
that it is probably <a
href="http://www.complang.tuwien.ac.at/papers/czezatke%26ertl00/#tth_sEc7">hard
to beat</a>, and not by much; for most uses, write performance is not
a big issue.  The main contributions to write performance over BSD FFS
have been completely asynchronous operations and improved block
allocation policies.

<p>Therefore, wrt performance the goal of the present project is just
to stay competetive with file systems like ext2 and ext3 on typical
workloads (and maybe exceptionally good or bad on artificial
workloads).  As a consequence, some of the design decisions in the
present project are quite different from those set out in early LFS
work.


<h2>Ideas</h2>

<p>
To understand the following, you will probably find it helpful to read
the first three sections of <a
href="http://www.complang.tuwien.ac.at/papers/czezatke%26ertl00/">LinLogFS
- A Log-Structured Filesystem For Linux</a> first.

<h3><a name="free-space-management">Free Space management</a></h3>

The original LFS proposals and implementations dropped traditional
free-space management (based on maintaining a free-blocks set) in
favour of segment-based allocation combined with a <em>cleaner</em> (a
kind of copying garbage collector) for reclaiming segments.  The
rationale was that this would lead to better performance because it
would eliminate most seek overheads during writing.

<p>I propose going back to the free-blocks set, for the following
reasons:

<ul>

<li>We did not come up with an efficient cleaning algorithm in the
presence of snapshots and clones.

<li>The good performance of ext2 demonstrates that the free-blocks
approach is no hindrance to good write performance.

<li>The cleaner has to perform a lot of copying when the file system
is close to full, which is the typical state for many file systems.
While the cleaner usually tries to run when there is nothing else to
do, in situations where space is tight and write activity is high, it
will have to work with users waiting for it and thus substantially
reduce user-visible performance.

<li>Writing a cleaner and tuning it for performance requires a
substantial effort, and maintaining the file system meta-data to make
a cleaner work effectively adds to this cost.  By going with the
traditional free-blocks set, we hope to be able to reuse much of the
work that has gone into making it fast and effective for traditional
file systems.

<li>In addition, using the allocator of a traditional file system
makes it more likely that the new file system performs similar to the
traditional file system.  One particular case that we are thinking of
is log files, that are distributed all over the disk by classic
segment-based writing, resulting in low read performance (that might
be fixed by the cleaner).

</ul>

You may wonder why one would still call the resulting file system a
log-structured file system.  That's because it still has the property
of performing <em>no in-place updates</em>:  All writes are made to
free blocks (and overwritten blocks are not freed before the new
blocks are commited).

<h4>Clones and reference counting</h4>

In the presence of snapshots and clones, a block may belong to several
clones, so freeing it in one clone does not necessarily free it
overall.  If garbage collection (cleaning) is not the solution, then
the usual fall-back in programming languages is reference counting,
and that approach is already well-established in file systems (i.e.,
inode counts), and that seems to be the right solution.  <a
href="http://www.cs.huji.ac.il/~orodeh/papers/ibm-techreport/H-0245.pdf">Reference
counting has been used successfully in a copy-on-write style
system</a>.

<p>Reference counting incurs the following costs: Clone creation is
cheap.  Writing to a block with a count >1 may be more expensive by a
constant factor than writing a plain block; the worst case is if a
fully-used indirect block is written to; then the reference counts of
all pointed-to blocks have to be updated (e.g., 1024 for a 4KB
indirect block with 4-byte block numbers).  Clone deletion cost is
proportional to the number of blocks unique to the clone.  These costs
seem acceptable.


<h3>Storing commit points</h3>

Once we eliminate segment-based memory allocation, how do we find
commit points?  One way would be to update the superblock(s) every
time we write a commit point.  This is slow, especially for loads with
small synchronous writes.

<p>An alternative is to allocate several extents (block ranges)
distributed on the disk for commit points and record these extents in
the superblock(s).  Then we can use these extents for many commit
points before having to update the superblocks again.

<p>By using several extents distributed across the disk, we can write
a commit block close to the data and meta-data of the files described
in the commit block, limiting the seek time during writing.

<p>One advantage of this approach is that we can read lots of commit
points with a few seeks, requiring relatively rare roll-forward start
points; however, to take full advantage of this requires organizing
the on-disk information (especially the change records) and the
in-memory data structures in a way that does not require many seeks on
mounting.

<p>To ensure that we recognize only commit points as commit points,
the file system has to clear the extent before using it for commit
points.  Time stamps in the commit points are used to determine their
order.

<p>So, overall the order of operations is:

<pre>
clear an extent of blocks, allocate them for commit points
commit
update the superblocks
start writing commit points to the extent
</pre>

Of course, at some point the FS has to free an extent (to allow adding
a new one to the superblock).  This can be achieved by either
producing a roll-forward start block (this frees several commit point
extents), or by consolidating commit information from that extent into
the other extents in some other way (this is probably not worth
doing).

<p>One variation on this idea would be to allocate and clear extents
for commit blocks on file system creation, and never change the
location of these extents, making writes to the superblock and later
clearing of the extent unnecessary.  In this variant only freeing of
individual commit blocks (instead of whole extents) would be
necessary.  Problem:  How to deal with bad blocks (if at all)?


<h3>Defragmentation</h3>

Even with a free-block space management, an LFS can lead to
fragmentation in some cases where a conventional file system does not
fragment, in particular, when partially overwriting files.  OTOH, it
is also quite easy to defragment in an LFS.  One way to deal with
fragmentation is to transparently create a new copy of the file.

<p>This is especially cheap when the file system has just read the
file, as the contents are still in the cache; this is also the time
when we notice without additional effort that the file is fragmented.

<p>As a special case of this kind of defragmentation, an overwritten
block of an existing file may be written back to the place where the
original block resided, if that block is now free.  This is somewhat
similar to how journaling file systems work.

<p>Snapshots and clones present a special problem in that context:
They will be used for making backups, and most fragmeted files will
probably be noticed during backup; So the file system wants to
defragment in the snapshot/clone.  If done so, this would inhibit
defragmentation in the user-side snapshot/clone.  How do we make the
defragmentation affect all heads?


<h3>Set-based Free-space management</h3>

For a long time I had a blind eye for reference counting and was
thinking of having a separate set of free blocks for each clone; I
guess that the <a
href="http://www.netapp.com/library/tr/3002.pdf">WAFL paper</a> (which
is otherwise a very good introduction, but does it that way) led me
astray, and probably also a dislike of reference counting from my programming language background.

<p>So the following are some ideas for how to make such a
set-per-clone approach workable in a more general setting than WAFL.
They might be useful if you need to know which clones a block belongs
to, how many blocks a clone would occupy by itself, etc.  But as long
as you don't know you need this, this is probably not worth reading.

<h4>Free-space management and snapshots</h4>

This subsection should be rewritten to use the concepts of the next
one, for better presentation.

<p>At each snapshot we have a set of free blocks (and complementary, a
set of allocated blocks) for the files reachable through this
snapshot.  The free block set for the whole file system is the
intersection of the free block sets of all the snapshots (including
the head of the file system).

<p>How do we get this global free-blocks set when we need it (in
particular, for allocating blocks for a file)?  We maintain it all the
time, and update it when something changes.  The changes that are of
interest are:

<ul>

<li>allocating a block in the head

<li>freeing a block in the head

<li>creating a snapshot

<li>destroying a snapshot

</ul>

Allocating a block works as usual, and creating a snapshot has no
immediate effect on the free-block set; when freeing a block, we
free it in the free list of the head (and I'll explain a little
later how this affects the global free list).  The most
interesting operation is destroying a snapshot.

<p>As long as we have only read-only snapshots, we can view the file
system as a linear sequence of snapshots.  Between two snapshots, a
set of blocks are allocated (born), and a set of blocks are freed
(killed).  The FS maintains the set of blocks that were born and
killed since the last snapshot.  This leads to the following
maintenance actions:

<dl>

<dt>allocate block <dd>
<pre>
free &= ~{block}
head.born |= {block}
</pre>

<dt>free block <dd>
<pre>
if (block in head.born)
  head.born &= ~{block}
  free |= {block}
else
  head.killed |= {block}
</pre>

<dt>create snapshot <dd>
<pre>
snapshot.born=head.born
snapshot.killed=head.killed
head.born = {}
head.killed = {}
</pre>

<dt>destroy snapshot<dd>
<pre>
free |= snapshot.born & snapshot.next.killed
snapshot.next.born |= snapshot.born &~ snapshot.next.killed
snapshot.next.killed &= ~snapshot.born
snapshot.next.killed |= snapshot.killed
</pre>
</dl>

<p><img src="./snapshot-free.gif" alt="Picture: Snapshots and free-block sets">

<p>So, overall, maintaining the free list across all snapshots is
relatively simple, allocating and freeing blocks in the head cost not
much more than in the simple case, creating a snapshot is cheap, and
the cost of destroying a snapshot depends on how much
allocation/freeing happened between the adjacent snapshots, which
should be acceptable.

<p>Some invariants for all snapshots (including head):

<ul>

<li>all born sets are disjoint

<li>all killed sets are disjoint

<li>The union of the killed sets before any snapshot is disjoint with
the born sets after that snapshot.

<li>Union(all born) = ~free

</ul>

It may be possible to be lazy with some of the operations, and
recreate the proper state of the various sets (actually we only need
the free-block set) through these invariants when needed.

<h4>Free-space management and clones</h4>

Clones complicate the issue, because the linear sequence of snapshots
with one writable head turns now into a tree with several writable
heads (the leaves of the tree).  So instead of snapshots and a head we
now think about clone points (where the history of the file system
forks into two clones), several heads, and edges from the start or a
clone point to another clone point or a head (it's probably better to
use the edges as separate concept in the linear case, too).

<p><img src="./clone-tree.gif" alt="Picture: Clones, clone points, and heads">

<p>The invariants above change; in particular, the killed sets of
alternative edges can overlap (the same block can be freed in several
alternative clones).

<p>The relevant file system operations now are:

<ul>

<li>allocating a block in a head

<li>freeing a block in a head

<li>cloning a head (forking it into two heads)

<li>deleting a head (and the clone point that spawned it)

</ul>

The associated maintenance actions are:

<dl>

<dt>allocate block for <tt>head</tt><dd>
<pre>
free &= ~{block}
head.born |= {block}
</pre>

<dt>free block from <tt>head</tt><dd>
<pre>
if (block in head.born)
  head.born &= ~{block}
  free |= {block}
else
  head.killed |= {block}
  if (block in all head.killed)
    free |= {block}
    all head.killed &= ~{block} /* necessary? correct? */
</pre>

<dt>clone <tt>head</tt> (producing <tt>head1</tt>, <tt>head2</tt>, and
<tt>clonepoint</tt>)<dd>
<pre>
clonepoint.born=head.born
clonepoint.killed=head.killed
head1.born = {}
head1.killed = {}
head2.born = {}
head2.killed = {}
</pre>

<dt>delete <tt>head</tt> and its <tt>clonepoint</tt>, thus fusing the
<tt>clonepoint</tt> and <tt>brother</tt> edges into <tt>brother</tt><dd>
<pre>
free |= head.born | (clonepoint.born & brother.killed)
brother.born |= clonepoint.born &~ brother.killed
brother.killed &= ~clonepoint.born
brother.killed |= clonepoint.killed
</pre>
</dl>

<h4>Set representation</h4>

We have to represent the various sets (in particular the free-blocks
set) permanently on disk, in the commit blocks, and in memory.

<p>The operations to perform are as outlined above; in addition, there
has to be a way to quickly find a mostly-contiguous set of blocks
(this is probably already solved well in current file systems that are
based on free-block sets).

<p>The representations of sets with few elements should be relatively
small, and set operations involving few elements on one or both sides
should be relatively cheap.  Also, reading partial/incremental sets
from commit points during mounting should not require many seeks to
complete the mount (either by having a data structure that does not
require much seeking, or by performing the seeking lazily after
mounting).

<p>Probably some tree-based set implementation is a good idea for the
on-disk data structure, with incremental change records for commit
points; for most in-memory sets (which will probably be relatively
small compared to the number of total blocks), such a tree will
probably be a good idea, too.

<h2>Links</h2> <a
href="http://www.complang.tuwien.ac.at/papers/czezatke%26ertl00/">LinLogFS
- A Log-Structured Filesystem For Linux</a> | <a
href="http://www.complang.tuwien.ac.at/anton/sync-metadata-updates.html">What's
wrong with synchronous metadata updates?</a>

<hr>
<a href="../">Anton Ertl</a>