Hole-punching compaction of an append-only databasePosted: December 19, 2011
The Baardskeerder project, as explained before, is an implementation of a B-tree-ish data structure, using an append-only file to persist the data to a storage device (e.g. a hard drive or SSD device). Not overwriting data has several advantages (consistency guarantees, efficiency,…), yet it also incurs a major disadvantage: over time, an increasing amount of data is stored and won’t ever be reclaimed, even if the data itself isn’t of any use any longer (e.g. a value node could be no longer referenced because the corresponding key was deleted or a new value was stored, internal nodes describing the tree itself become invalid because the tree structure changes when new key/value-pairs are added,…).
Even though storage capacity becomes cheaper, there’s no such thing as an infinite disk, and not freeing disk space, even though the data it contains has no actual value any longer, is useless.
One way to reduce the overhead is to rewrite the database, so it only contains the data contained in the tree at the moment the action is initiated. We call this a “full compaction”, rewriting the database to a second file. This has a major advantage: after compaction, the new database file is the smallest you can get (it doesn’t contain any overhead data), and as such data is stored very dense.
Their are some drawbacks as well: the operation can’t be performed online without taking specific measures, since the resulting database will only represent a snapshot of the original one. There might be a storage capacity issue as well: because the data is copied to another file, we need the disk space available to be able to perform these copies. In the worst case, if no data can be discarded at all, we’ll need 100% of the original database size available as free space, i.e. when compacting to the same disk, the size of the original database should be less than 50% of the total disk capacity. Do note this really is a worst-case scenario.
Next to this, when the resulting database of the compaction process should be stored on the hard disk which also contains the original database, performance will suffer: lots of random IO will be needed to read the original database, and because of this the IO operations to write the output file, which should be sequential, append-only during normal operation, will become random as well (because it’s interleaved with random reads).
Overall, even though the “full compaction” mechanism yields the best result in the end, it’s not something you want to run too frequently. On the other hand, we do want to return the overhead storage to the system so it can be reused for more useful purposes. This operation should be cheap, so it can be executed more frequently, and the number of IO operations required to perform the operation should be as low as possible.
First, a little intermezzo (you can skip this if you know what a ‘sparse’ file is). Most “modern” filesystems, especially (but not only) in the UNIX world, have support for ‘sparse’ files. A sparse file is a file which contains empty regions, blocks which appear to be filled with zeros, but aren’t really mapped to blocks on the block device to which the filesystem writes its data. These empty regions can be added to the file when it’s being written (you can extend a file with such empty space), and data can be written to these empty regions later on (at this point in time, the blocks are mapped to physical blocks, of course).
These ‘empty’ regions are taken into account when determining the size of the file, but they don’t occupy any space on the disk device. When reading from these regions, all you get are ‘zero’ bytes.
Here’s a simple demonstration, creating a large (2GB) sparse file containing no data at all, then displaying its apparent size as well as the actual size on disk:
[nicolas@tau ~]$ dd if=/dev/null of=demo.dat seek=$((2 * 1024 * 1024)) bs=1k 0+0 records in 0+0 records out 0 bytes (0 B) copied, 2.6617e-05 s, 0.0 kB/s [nicolas@tau ~]$ ls -lh demo.dat -rw-rw-r--. 1 nicolas nicolas 2.0G Dec 16 16:10 demo.dat [nicolas@tau ~]$ du --block-size=1 demo.dat 0 demo.dat
This shows the size of the generated file appears to be 2.0GB, but it’s stored to disk using exactly 0 bytes (next to the inode/metadata size, of course).
During normal operation, one can create a sparse region in a file by using the ftruncate(2) system call. This only allows to append an empty region to the end of a file, not marking an existing region as being ‘empty’.
Luckily, (very) recent Linux kernel versions allow this in a couple of file systems: support for the FALLOC_FL_PUNCH_HOLE option for the fallocate(2) call is available in at lease ext4, XFS and OCFS2. Using this option, an existing region in a file can be marked as ‘empty’, even though it contained data before. The file system will change its mappings so the region becomes sparse, and zero out any ‘overflow’ bytes at the left or right end side of the region (since only blocks can be unmapped).
Thanks to this feature, we can implement cheap resource deallocation by punching holes in a database file for every region which is no longer of any use. Blocks will become available again, and the file system will handle defragmentation (if required), whilst all offsets encoded in the internal nodes will remain valid.
Now we know the mechanism to free unused resources, we need to find out which resources can be freed, i.e. calculate where we can punch holes in the database file. Initially, we thought about walking the tree, keeping track of the offset and size of all referenced entries as a list of integer pairs, then merging adjecent pairs, and finally punching holes at all areas of the file which aren’t part of this final map.
This approach has a major drawback: a lot of state is being accumulated throughout the traversal, which results in lots of memory usage: the number of entries required to represent a large database can be huge!
Thanks to a very basic observation, a more simple, efficient and elegant algorithm is available though! Here’s the deal: entries are added to the database file using append-only writes. Any entry can contain pointers to other entries, encoded as the offsets of these entries in the file. An entry can, at all times, only point to other entries that were stored before itself. As such, an entry can only point to entries on its ‘left’, at lower offsets, an entry can never reference another entry on its ‘right’.
This simplifies things a lot, and allows us to implement the following recursive algorithm: the state passed at every function call consists of only 2 values: a set of offsets (integers) S, and a single offset (integer) O.
In every iteration, the largest integer in set S is determined, and removed from the set. Let’s call this value C. This is the offset of the entry we’ll handle during this iteration. We read the entry from disk, and parse it. If it’s an entry which references other entries (i.e. a Leaf or Index entry), we add the offsets of every entry it references to set S. Do note, at all times, these offsets will be smaller than C (entries can only refer to other entries found earlier in the file!).
Next, we calculate the size L of the entry we’re dealing with, and calculate the offset of the end of the entry in the file, E = C + L.
At this point, we can safely punch a hole from E to O (read on to know what O represents).
Now we check whether S is empty. If it is, there are no more referenced entries below offset C. As such, we can punch a hole from the beginning of the file (or, in the Baardskeerder on-disk format, after the metadata blocks) up to C.
Otherwise, if S isn’t empty, a recursive call is made, passing the updated set and using C as value for O.
Finally, we need to initiate the operation. Here’s how: first, find the commit entry of the oldest version of the database to be retained (remember, Baardskeerder is a functional/persistent data structure!). Any older versions (with commit entries at lower offsets) will become invalid because of the compaction process. Now, the recursive process can be initiated using the singleton set containing the offset of the root entry to which the commit entry points as S, and the offset of the commit entry as value of O.
Note set S will, unlike the list in the initial approach, never be very big: it will, at all times, only contain the offset of all known entries at the left of the one we’re dealing with, it’ll never contain the offsets of all entries in the database.
Here’s a Haskell implementation of the algorithm, which might clarify some things:
import Prelude hiding (null) import Data.ByteString (ByteString) import Data.Int (Int32) import Data.IntSet (IntSet) import qualified Data.IntSet as IS import System.Posix (COff, Fd) type Key = ByteString type Value = ByteString type Offset = COff type Length = Int32 data Entry = Commit Offset Length Offset | Leaf Offset Length Value | Node Offset Length [(Key, Offset)] getEntry :: Fd -> Offset -> IO Entry getEntry f o = undefined punch :: Fd -> Offset -> Length -> IO () punch f o l = undefined data CompactState = CS Offset IntSet fileStart :: Offset fileStart = 4096 compact :: Fd -> Offset -> IO () compact f o = do c <- getEntry f o case c of Commit _ _ r -> compact' f $ CS o (IS.singleton $ fromIntegral r) otherwise -> error "compact: not a commit entry" compact' :: Fd -> CompactState -> IO () compact' f (CS n os) = do -- At this stage, s should never be empty let (h, os') = IS.deleteFindMax os h' = fromIntegral h e <- getEntry f h' let (e', os'') = case e of Leaf o l _ -> (o + fromIntegral l, os') Node o l rs -> (o + fromIntegral l, foldr (IS.insert . fromIntegral) os' $ map snd rs) otherwise -> error "compact': invalid entry type" punch f e' $ fromIntegral (n - e') if (IS.null os'') then punch f fileStart $ fromIntegral (h' - fileStart) else compact' f (CS h' os'')
To conclude, some related “real-world software engineering” notes:
- Since only blocks can be unmapped, calls to “punch” should be block-aligned, otherwise overflow is filled with zeroes, which results in completely useless and unnecessary random write operations
- One might want to use a threshold to punch only if the hole would be big enough (e.g. at least 10 consecutive blocks): creating lots of small holes can introduce unwanted fragmentation (this depends on the usage pattern of your database)
- It’s possible to be “smart” and only punch holes at regions which are still backed by physical blocks (the algorithm doesn’t, and shouldn’t, take any existing holes into account!). There are 2 ways to request information about existing holes to the file system: using ioctl(2) with the FS_IOC_FIEMAP option, if the file system supports this, or using lseek(2) with SEEK_HOLE and SEEK_DATA (which also requires filesystem support to work correctly). No testing has been performed to check out whether this provides any benefits yet.