Rediscovering the RSync Algorithm
Posted: February 14, 2012 Filed under: algorithms, OCaml, Programming, Uncategorized | Tags: algorithm, OCaml, optimization, Programming 10 Comments »A:Ok, you’re synchronizing this over the web;
and what do you use for the synchronization?
B: Oh, we implemented the rsync algorithm.
A: uhu. And what do you do with really big files?
B: The same.
A: And you also synchronise folders?
B: Yes.
A: And how do you do that?
B: we iterate over the folder, using the algorithm on every file, recursing over subfolders.
A: Can you try 2 things for me? First, a very large file; and second, a large codebase, and see if it holds.
Introduction
First of all, I am an admirer of the (original) rsync algorithm. (I think) it was a first stab at file synchronization, and it was so good people still implement it today.
But if you don’t understand its strenghts and weaknesses you might be in for a disappointment.
The Problem
You have 2 files, A’ and A that are different but alike. They reside on 2 different locations connected through a network. You want to make A’ identical to A.
The simplest solution is to just copy A, but given the similarity between the two files, there has to be a better way.
Historical Interlude
Networks used to be notoriously bad in the early 90s. Everybody who was transferring archives over large distances instinctively knew about a critical download size.
If the archive was too large, it would take too long, yielding a 100% chance something would go wrong somewhere resulting in an interrupted download. Even if the (up- or) download succeeded, chances were a small part of the file got corrupted, and you had to start over. The two first alleviations to this problem were checksums to detect accidental corruptions, and resumptions (being able to start a download at a certain offset).
RSync took care of interrupted downloads, and also provided a better solution when your file was corrupt. On top of that, it allowed low cost propagation of small changes, opening up a whole new range of applications. System administrators had a shiny new hammer.
The RSync Strategy
RSync just does a single round trip. First it creates a signature of A’, sends it over. On the other location it scans the local file, tries to find parts that are in the signature, while constructing a recipe as a stream of instructions. It’s possible to derive the algorithm starting from a primitive version, improving it step by step.
Since it’s fun too, I’ll be doing that here. While we’re playing, we’ll abstract away from IO, because it clouds the algorithmical view.
Mark 0
Let’s attack this in pure caveman style. Making a signature is splitting the file in blocks of equal size (except maybe the last). Iterating over the blocks, you calculate a digest and accumulate digests and block identifiers. Block identifiers are just their number: the first block has id 0, the second block id 1 aso.
let file_signature f b_size =
let f_len = String.length f in
let rec loop acc s i =
if s = f_len
then acc
else
let b_len = min b_size (f_len - s) in
let b = String.sub f s b_len in
let b_sig = block_signature b in
let acc' = (b_sig,i) :: acc in
loop acc' (s+b_len) (i+1)
in
loop [] 0 0
We have lots of choice to calculate a block signature. Let’s be lazy and pick Digest.string which is the md5 checksum of the block.
let block_signature block = Digest.string block
To recreate the file you need to interprete the stream of instructions. But what would these instructions be?
Well, in this version, you can be told to copy over a block or write a char.
type instruction = | C of char | B of int
Ok, how do you combine the signature together with the new file to generate a stream of instructions?
First thing that comes to mind is to scan over the new file, starting at position s
- consider the block starting at s and try to find it in the signature.
- if you find it, add a B j instruction, and jump a block forward.
- if you miss it, add a C c instruction, and step forward 1 position.
Let’s do that:
let updates f0_sig b_size f1 =
let f1_len = String.length f1 in
let rec loop acc s =
if s = f1_len
then List.rev acc
else
let b_len = min b_size (f1_len - s) in
let block = String.sub f1 s b_len in
let u,step =
try
let d = block_signature block in
let i = List.assoc d f0_sig in
(B i), b_len
with Not_found -> (C block.[0]), 1
in
let acc' = u :: acc in
loop acc' (s + step)
in
loop [] 0
The last step in our synchronisation scheme is to create a file using the old file,
together with the stream of instructions.
let apply old b_size ins =
let old_len = String.length old in
let buf = Buffer.create old_len in
let add_block i =
let s = i * b_size in
let b_len = min b_size (old_len - s) in
let block = String.sub old s b_len in
Buffer.add_string buf block
in
let add_char c = Buffer.add_char buf c in
let () = List.iter (function
| B i -> add_block i
| C c -> add_char c)
ins
in
Buffer.contents buf
So it took 60 lines of code to have a first stab at a synchronisation algorithm.
Let’s try this on an example:
let bs = 5
let remote = "aaaaabbbbbcccccdddddeeeeefffffggggghhhhhiiiiijjjjjkkk"
let mine = "aaaaabXbbbcccccddddde012"
let mine_s = file_signature mine bs
let delta = updates mine_s bs remote
let mine2 = apply mine bs delta;;
val bs : int = 5
val remote : string = "aaaaabbbbbcccccdddddeeeeefffffggggghhhhhiiiiijjjjjkkk"
val mine : string = "aaaaabXbbbcccccddddde012"
val mine_s : (Digest.t * int) list =
[("$\240\t\221\19200\222\199\2035\190|\222~#\n", 4);
("P\248M\175:m\253j\159 \201\248\239B\137B", 3);
("g\199b'k\206\208\158\228\22314\2137\209d\234", 2);
("\196\148\"\21926Lm\179V E=\201O\183,", 1);
("YO\128;8\nA9n\214=\2029P5B", 0)]
val delta : instruction list =
[B 0; C 'b'; C 'b'; C 'b'; C 'b'; C 'b'; B 2; B 3; C 'e'; C 'e'; C 'e';
C 'e'; C 'e'; C 'f'; C 'f'; C 'f'; C 'f'; C 'f'; C 'g'; C 'g'; C 'g';
C 'g'; C 'g'; C 'h'; C 'h'; C 'h'; C 'h'; C 'h'; C 'i'; C 'i'; C 'i';
C 'i'; C 'i'; C 'j'; C 'j'; C 'j'; C 'j'; C 'j'; C 'k'; C 'k'; C 'k']
val mine2 : string = "aaaaabbbbbcccccdddddeeeeefffffggggghhhhhiiiiijjjjjkkk"
Ok, it works, but how good is it?
Before I can answer this, first a note on block size. There are some ‘forces’ to be reckoned with
- the blocksize needs to be big compared to the block signature.
- if the blocksize is big, you will find less matches between the signature and the new file, so you need send more data back
- if the blocksize is small, you have lots of them, meaning your signature will be bigger
and you need an efficient lookup
The best blocksize will be depend not only on the file size, but on the actual changes.
How important is it really?
Well, let’s sync 2 images:
![]() |
![]() |
These 2 images are bitmaps of 5.5 MB each (stored as .bmp).
(I actually uploaded smaller versions as it seems useless to let your browser download more than 10MB for just 2 silly image samples)
I’ll sync’em using different blocksizes and see what gives.
let main () =
let bs = int_of_string (Sys.argv.(1)) in
let () = Printf.printf "bs=%i\n" bs in
let remote = get_data "new_image.bmp" in
let () = Printf.printf "remote: size=%i%!\n" (String.length remote) in
let mine = get_data "old_image.bmp" in
let mine_s = X.file_signature mine bs in
let () = Printf.printf "mine_s: len=%i%!\n" (Array.length mine_s) in
let delta = X.updates mine_s bs remote in
let (nbs,ncs) = List.fold_left (fun(bs,cs) i ->
match i with
| X.B _ -> (bs+1,cs)
| X.C _ -> (bs,cs+1)
) (0,0) delta
in
let mine2 = X.apply mine bs delta in
let () = Printf.printf "mine2: size=%i%!\n" (String.length mine2) in
let () = Printf.printf "bs=%i;cs=%i\n" nbs ncs in
| block size | # block signatures | blocks | chars | time |
| 8192 | 704 | 535 | 1384448 | 71s |
| 4096 | 1407 | 1084 | 1323008 | 92s |
| 2048 | 2813 | 2344 | 960512 | 92s |
| 1024 | 5626 | 4995 | 646144 | 115s |
| 512 | 11251 | 10309 | 482304 | 172s |
| 256 | 22501 | 20972 | 391424 | 283s |
| 128 | 45001 | 42508 | 319104 | 537s |
The first column is the block size. The second is the number of blocks in the file, the third is the number of B instructions and the fourth is the number of C instructions.
The last columns is measured execution time on my laptop. Processing time is the biggest issue. Ocaml is fast, but not fast enough to compensate for my brutal inefficiency. Imagine what it would do to a 4GB movie.
Mark 1
The problem is quickly revealed: List.assoc is not suited for long lists.
A better solution is use an array, sort it on block signature, and use binary search
(using a hashtable would be viable too).
let block_signature block = Digest.string block
let file_signature f b_size =
let f_len = String.length f in
let s = ref 0 in
let n_blocks = (f_len + b_size -1) / b_size in
Array.init n_blocks
(fun i ->
let start = !s in
let b_len = if start + b_size > f_len then f_len - start else b_size in
let b = String.sub f start b_len in
let b_sig = block_signature b in
let () = s := start + b_len in
(b_sig,i)
)
type instruction =
| C of char
| B of int
let updates f0_sig b_size f1 =
let my_cmp (s0,i0) (s1,i1) = String.compare s0 s1 in
let () = Array.sort my_cmp f0_sig in
let len = Array.length f0_sig in
let rec lookup b=
let rec find min max =
let mid = (min + max) / 2 in
let (ms,mi) = f0_sig.(mid) in
if ms = b
then mi
else if min > max then raise Not_found
else if b > ms then find (mid+1) max
else find min (mid -1)
in
find 0 (len -1)
in
let f1_len = String.length f1 in
let rec loop acc s =
if s = f1_len
then List.rev acc
else
let b_len = min b_size (f1_len - s) in
let block = String.sub f1 s b_len in
let u,step =
try
let d = block_signature block in
let i = lookup d in
(B i), b_len
with Not_found -> (C block.[0]), 1
in
let acc' = u :: acc in
loop acc' (s + step)
in
loop [] 0
let apply old b_size ins =
let old_len = String.length old in
let buf = Buffer.create old_len in
let add_block i =
let s = i * b_size in
let b_len = min b_size (old_len - s) in
let block = String.sub old s b_len in
Buffer.add_string buf block
in
let add_char c = Buffer.add_char buf c in
let () = List.iter (function
| B i -> add_block i
| C c -> add_char c)
ins
in
| block size | # block signatures | blocks | chars | time(s) |
| 8192 | 704 | 535 | 1384448 | 41 |
| 4096 | 1407 | 1084 | 1323008 | 20 |
| 2048 | 2813 | 2344 | 960512 | 7.8 |
| 1024 | 5626 | 4995 | 646144 | 1.9 |
| 512 | 11251 | 10309 | 482304 | 1.1 |
| 256 | 22501 | 20972 | 391424 | 0.8 |
| 128 | 45001 | 42508 | 319104 | 0.9 |
Wow, this is quite unexpected (but we’re not complaining). So what happened? Well, the lookup was so dominating, it was cloaking all other behaviour.
Now, with the lookup out of the way, other things can be observed. The problem now is that a ‘miss’ not only causes a C instruction to be emitted, but more importantly, it causes another digest to be calculated. The less blocks are found, the higher the processing time.
Mark 2
The cost of the miss was solved by Andrew Tridgell by introducing a second, weaker hash function.
He used the Adler-32 checksum which is a rolling checksum. ‘Rolling’ means that
adler32(buffer[a+1:b+1])= cheap(adler32(buffer[a:b]), which makes it suitable for our problem here. The ocaml standard library does not have an adler32 module, so I hacked one up.
It’s a naieve implementation in that it follows the definition closely. In fact, most of the modulo operations can be avoided by doing some extra administration.
I didn’t bother.
module Adler32 = struct
type t = {mutable a: int; mutable b: int; mutable c: int}
let padler = 65521
let make () = {a = 1 ;b = 0; c = 0}
let from buf offset length =
let too_far = offset + length in
let rec loop a b i=
if i = too_far
then {a; b; c = length}
else (* modulo can be lifted with a bit of math *)
let a' = (a + Char.code(buf.[i])) mod padler in
let b' = (b + a') mod padler in
loop a' b' (i+1)
in
loop 1 0 offset
let reset t = t.a <- 1;t.b <- 0; t.c <- 0
let digest t = (t.b lsl 16) lor t.a
let out t c1 =
let x1 = Char.code c1 in
t.a <- (t.a - x1) mod padler;
t.b <- (t.b - t.c * x1 -1) mod padler;
t.c <- t.c - 1
let rotate t c1 cn =
let up x = if x >= 0 then x else x + padler in
let x1 = Char.code c1 in
let xn = Char.code cn in
t.a <- up ((t.a - x1 + xn) mod padler);
t.b <- let f = (t.c * x1) mod padler in
let r = (t.b - f + t.a -1) mod padler in
up r
let update t buf offset length =
let too_far = offset + length in
let rec loop i =
if i = too_far then ()
else
let x = Char.code buf.[i] in
let () = t.a <- (t.a + x) mod padler in
let () = t.b <- (t.b + t.a) mod padler in
let () = t.c <- t.c + 1 in
loop (i +1)
in
loop offset
let string block =
let t = from block 0 (String.length block) in
digest t
end
Adler32 is much weaker than md5 and using it exposes you to collisions. So in fact, it acts as a cheap test that might yield false positives. That’s the reason the rsync algo keeps both: if the adler32 of the buffer is found in the signature, there is a second check to see if the md5s match. The fact one sometimes needs to rotate the checksum and at other times needs to reinitialize if from a part of the buffer, complicates the calculation of the updates a bit.
The code is shown below.
let updates f0_sig b_size f1 =
let my_cmp (wh0,sh0,i0) (wh1, sh1,i1) = compare wh0 wh1 in
let () = Array.sort my_cmp f0_sig in
let len = Array.length f0_sig in
let rec lookup w =
let rec find min max =
let mid = (min + max) / 2 in
let (mw, ms,mi) = f0_sig.(mid) in
if mw = w
then (ms,mi)
else if min > max then raise Not_found
else if w > mw then find (mid+1) max
else find min (mid -1)
in
find 0 (len -1)
in
let f1_len = String.length f1 in
let weak = Adler32.from f1 0 b_size in
let rec loop acc b e =
if b = f1_len
then List.rev acc
else
let wh = Adler32.digest weak in
let step_1 () =
let bc = f1.[b] in
let a = C bc in
let b' = b + 1 in
if e +1 < f1_len
then
let e' = e + 1 in
let ec = f1.[e] in
let () = Adler32.rotate weak bc ec in
loop (a:: acc) b' e'
else
let e' = e in
let () = Adler32.out weak bc in
loop (a:: acc) b' e'
in
try
let (s0,i0) = lookup wh in
let sh = Digest.substring f1 b (e - b) in
if s0 = sh
then
let b' = e in
let e' = min f1_len (e + b_size) in
let a = B i0 in
let () = Adler32.reset weak in
let () = Adler32.update weak f1 b' (e' - b') in
loop (a :: acc) b' e'
else step_1 ()
with Not_found -> step_1 ()
in
loop [] 0 b_size
The code indeed is a bit messier as we have more things to control at the same time, but it’s still quite small. Let’s see how wel it performs:
| block size | # block signatures | blocks | chars | time(s) |
| 8192 | 704 | 535 | 1384448 | 0.9 |
| 4096 | 1407 | 1084 | 1332008 | 0.9 |
| 2048 | 2813 | 2344 | 960512 | 0.8 |
| 1024 | 5626 | 4995 | 646114 | 0.6 |
| 512 | 11251 | 10309 | 482304 | 0.6 |
| 256 | 22501 | 20401 | 537600 | 0.7 |
| 128 | 45001 | 42508 | 319104 | 0.7 |
This almost completely removes the cost of a miss; at least for things of this size. The choice of blocksize does however affect the amount of data you need to send over.
There is a lot of other things you can do here:
- Block Size Heuristic: something like O(sqrt(file_size))
- SuperInstructions: make instructions for consecutive Blocks, and consecutive Chars
- Emit function: don’t accumulate the updates, but emit them (using a callback function)
- Smaller signatures: you can consider to drop some bytes from the strong hash
- IO
- Compress update stream
- …
The last remaining problem is that some modifications are not handled well by the algorithm (for example 1 byte changed in each block).
Maybe you could try a better algorithm.
There are lot’s of them out there, and they are worth checking out. (Before you know it you’ll be dabling into merkle trees or set reconciliation )
Anyway, I already exceeded my quotum for this post, but I still want to say a little thing about synchronisation of folders
The Next Problem
You have 2 trees of files, A’ and A that are different but alike. They reside on 2 different locations connected through a network. You want to make A’ identical to A.
What Not To Do
Don’t walk the folder and ‘rsync’ each file you encounter.
A small calculation will show you how bad it really is.
Suppose you have 20000 files, each 1KB. Suppose 1 rsync costs you about 0.1s
(reading the file, sending over the signature, building the stream of updates, applying them).
This costs you about 2000s or more than half an hour. System administrators know better:
they would not hesitate: “tar the tree, sync the tars, and untar the synced tar”.
Suppose each of the actions takes 5s (overestimating) you’re still synced in 15s.
Or maybe you can try unison. It’s ocaml too, you know.
have fun,
Romain.



[...] This was a very interesting read: http://blog.incubaid.com/2012/02/14/rediscovering-the-rsync-algorithm/ [...]
It is an interesting read but I have trouble understanding the code. Is that in python? You may be able to get more audience if you use pseudo code, or maybe a C language? Not everyone do Perl, Python… just saying…
The code is OCaml (or Objective Caml), which is a functional programming language in the ML family.
It might take some time to get used to reading (or writing) it, but it’s certainly worth it… At Incubaid we develop several projects using OCaml, including Arakoon (a distributed, consistent key-value store) and Baardskeerder (an append-only B-tree-like database). We presented some of the reasons for this choice in a presentation available here.
Suggest the original rsync algorithm from Tridge’s PhD thesis. It’s extremely readable:
http://www.samba.org/~tridge/phd_thesis.pdf
Reblogged this on WordPress Trends.
First “man rsync” to really understand its power, it does work at 8K blocks of a file.
Using various combinations of switches you can optimize using rsync as fit for your needs.
Things that came prior to rsync
Kermit http://en.wikipedia.org/wiki/Kermit_(protocol)
X/Y/Zmodem http://en.wikipedia.org/wiki/XMODEM
rdist http://linux.about.com/library/cmd/blcmdl1_rdist.htm
You seem to be mixing algorithmic efficiency arguments with tool efficiency arguments.
rsync (the tool) is smart enough not to take half an hour to sync 2 trees of 20k files. Presumably it does this using metadata such as access/modification times — so it doesn’t even send the signatures at all.
I would venture to guess that in general practice, “tar the tree, sync the tars, and untar the synced tar” would be slower. A sysadmin would be unlikely to suggest this.
-Ben
Well, I probably lack writing skills,as I (apparently incorrectly) thought the context was clear. The whole post is a comment on the prologue conversation, in which a culprit just combines the rsync algorith with the strategy of walking the file system. RSync (the tool) is not so naive to do this neither. Anyway, remote synchronizing of trees is a balance exercise in which you want to minimize the number of roundtrips and maximize their effectiveness.
Btw, I’m not sure RSync (again, the tool) is optimal in doing this (but that’s another discussion entirely).
[...] original disponível em http://blog.incubaid.com/2012/02/14/rediscovering-the-rsync-algorithm/ [...]
It’s not like network transfers were unreliable before 1996, or that the rsync protocol offered a significant improvement in that particular area. What has improved is the application-level error handling, as well as, of course, with rsync, efficient differential transfer of partial files. Many trivial network transfer tools had poor handling of interrupted transfers, and yes, interruptions were certainly more common back then, but as witnessed by the large FTP mirror networks of the time, it was hardly impossible to transfer large files and large collections of files correctly.