Interludium: It’s Doomsday!
Posted: February 28, 2013 Filed under: algorithms | Tags: conway, doomsday, ocaml geek Leave a comment »And Now for Something Completely Different (Monty Python’s Flying Circus)
Introduction
This post explains how to calculate the day of the week, with all calculations inside your head.
It’s a geekish party trick, and doesn’t really belong on a research blog,
but since today is in fact doomsday, I feel I can get away with it.
Anyway, here follows the algorithm written down in ocaml, but optimized for my brain.
Doomsday
Doomsday is the most important day of the year. It’s the last day of February:
February 28th on normal years, February 29th on leap years.
If you know what day of the week it is on Doomsday, you quite easily adjust for any other day of the year.
The trick is to have an anchor in each month that is the same day of the week as Doomsday.
Anchors for each month
The Anchor for February is doomsday itself.
The anchor for another even month m is m. The anchor for most odd months can be remembered by
the following mnemonic:
I work 9-5 at the 7-11
It simply helps you remember that doomsday for the 9th month is the fifth day of that month and vice-versa.
This only leaves January and March. March is easy: the zeroth of March is the last day of February, which is doomsday. January is a bit more difficult: most years it’s January 3rd, on leap years it’s the 4th. The ocaml summary is this:
let anchor_of_month yy = function (* "I work 9-5 at the 7-11" *)
| 1 -> 3 + leap yy
| 2 -> 28 + leap yy
| 3 -> 0 (* last day of Feb is zeroth day of Mar *)
| 5 -> 9
| 9 -> 5
| 7 -> 11
| 11 -> 7
| m -> m
An example:
- 25/12/2013 Christmas
- Today (28/2/2013 = Doomsday 2013) is a Thursday
- Hence, December 12 (12/12) is also a Thursday.
- 19 Thu, 26 Thu, 25 is a Wednesday.
This is all you need to know to be able to calculate the day of the week for any day in the current year.
The funny thing is that this is rather simple to explain to humans, but not so trivial to formalize.
Here’s an attempt:
let day_of_week hh yy mm dd =
let doom = anchor_of_month yy mm
and w = of_year hh yy
in
if doom >= dd
then zoom_down doom w dd
else zoom_up doom w dd
The two zoom functions are what we do when we move from the anchor in the month to the day we want.
let rec zoom_down known w dd =
if known = dd then days.(w)
else
let t = known -7 in
if t > dd
then zoom_down t w dd
else zoom_down (known - 1) (prev w) dd
let rec zoom_up known w dd =
if known = dd then days.(w)
else
let t = known + 7 in
if t < dd
then zoom_up t w dd
else zoom_up (known + 1) (next w) dd
If this were code written for a computer, I would have distilled something more compact, but my brain doesn’t handle higher order functions very well.
Doomsday for other years
The most frequent use of the doomsday algorithm as a party trick is the calculation the day of the week of someone’s birthday. The algo will yield a number, which you will interprete as a day of the week.
Days are numbered like this:
| Sun | 0 |
| Mon | 1 |
| … | … |
The next thing you need to know is the adjustment for the century.
For 19YY years it’s 3 (WEDnesday) while for 20YY years, it’s 2 (Y-TUE-K).
There are rules for other centuries, but I didn’t learn them, as I don’t frequent vampire parties.
The formula I tend to use to calculate the doomsday of a year is shown below.
let of_year hh yy =
let start = match hh with
| 19 -> 3 (* WE'D be in this day *)
| 20 -> 2 (* Y-Tue-K *)
in
let d12 = yy / 12 in
let r12 = (yy mod 12) in
let r4 = r12 / 4 in
(start + d12 + r12 + r4) mod 7
For some years it’s simpler to move the mod 7 up a bit.
Some tests
If you want to practice this, you need some special dates. Below are some exercises.
29/02/2013 => Thu
25/05/2013 => Sat
25/12/2013 => Wed
01/01/2013 => Tue
11/09/2001 => Tue
06/10/1973 => Sat
03/11/1971 => Wed
20/07/1969 => Sun
08/05/1945 => Tue
11/11/1918 => Mon
Closing words
I picked this up on various resources. The good stuff comes from there, the mistakes are my own.
- Doomsday_rule page on wikipedia
- Rudy Limeback’s page on the doomsday algorithm
- Mind performance hacks (O’Reilly)
Have fun,
Romain.
User functions in Arakoon
Posted: February 1, 2013 Filed under: Arakoon, Baardskeerder, OCaml, Programming, Uncategorized | Tags: arakoon, baardskeerder, key value store, OCaml, user functions Leave a comment »Mahomet cald the Hill to come to him. And when the Hill stood still, he was neuer a whit abashed, but said;
If the Hill will not come to Mahomet, Mahomet wil go to the hill.
Francis Bacon
Introduction
Arakoon tries to be a simple distributed key value store that favours consistency over availability.
From time to time, we get feature requests for additional commands like:
- assert_exists: assert a value exists for the key without caring what the value actually is
- increment_counter: increment (or create) a counter and return the new value.
- queue operations : add an element to the front/back of a double ended queue or pop an element
- set_and_update_X: insert a key value pair and update some non-trivial counter X (think averages, variances,…)
- …
The list is semi-infinite and the common thing here is that they are too complex/specific/weird/… to do them in one step using the provided interface. Of course, you can do all of this on the client side, but it will cost extra network round-trips. In distributed systems, you really want to keep the number of round-trips low, which pushes you towards these kind of feature requests.
Once you decided (performance bottlenecks probably) that you need extra functionality there are two things you can do. First, you can try to force or entice us into adding them to the core interface or alternatively, you can get by using Arakoon’s “user functions”. For some reason people fear them but there’s no real technical reason to do so.
This blog post will cover two things. First we’ll go in to the nitty gritty of coding and deploying user functions and then we’ll look at some of the strategic/architectural challenges of user functions.
How do user functions work?
The high level view is this: you build a user function, and register it to an Arakoon cluster before you start it. Then, at runtime, you can call it, using any client, with a parameter (a string option) and get back a result (string option). On the server side, the master will log this in its transaction log, try to reach consensus with the slave(s) and once that is the case, the user function will be executed inside a transaction. The result of that call will be sent to the client. If an exception occurs, the transaction will be aborted. Since Arakoon logs transactions it can replay them in case of calamities. This has a very important impact: since Arakoon needs to be able to replay the execution of a user function, you cannot do side effects, use random values or read the system clock.
Running Example
We’re going to try to build a simple queue API.
It will offer named queues with 2 operations: push and pop. Also, it’s a first-in-first-out thingy.
Arakoon 1
Client side API
Arakoon 1 offers the following API for user functions.
def userFunction(self, name, argument): '''Call a user-defined function on the server @param name: Name of user function @type name: string @param argument: Optional function argument @type argument: string option @return: Function result @rtype: string option '''
Let’s take a look at it. A userFunction call needs the name, which is a string, and an argument which is a string option and returns a result of type string option. So what exactly is a string option in Python? Well, it’s either a string or None. This allows a user function to not take input or to not yield a result.
Server side API
The server side API is in OCaml, and looks like this:
class type user_db = object method set : string -> string -> unit method get : string -> string method delete: string -> unit method test_and_set: string -> string option -> string option -> string option method range_entries: string option -> bool -> string option -> bool -> int -> (string * string) list end
User functions on server side match the client’s opaque signature.
user_db -> string option -> string option
Queue’s client side
Let’s create the client side in python. We’ll create a class that uses an Arakoon client and acts as a queue. The problem with push is that we need to fit both the name and the value into the one paramater we have available. We need to do our own serialization. Let’s just be lazy (smart?) and use Arakoon’s serialization. The code is shown below.
from arakoon import Arakoon
from arakoon import ArakoonProtocol as P
class ArakoonQueue:
def __init__(self, name, client):
self._name = name
self._client = client
def push(self, value):
input = P._packString(self._name)
input += P._packString(value)
self._client.userFunction("QDemo.push", input)
def pop(self):
value = self._client.userFunction("QDemo.pop", self._name)
return value
That wasn’t too hard now was it?
Queue, server side
The whole idea is that the operations happen on server side, so this will be a tat more complex.
We need to model a queue using a key value store. Code-wise, that’s not too difficult.
For each queue, we’ll keep 2 counters that keep track of both ends of the queue.
A push is merely getting the qname and the value out of the input, calculating the place where we need to store it, store the value there and update the counter for the back end of the queue. A pop is similar but when the queue becomes empty, we use the opportunity to reset the counters (maybe_reset_counters). The counter representation is a bit weird but Arakoon stores things in lexicographical order and we want to take advantage of this to keep our queue fifo. Hence, we need to make the counter in such a way the counter’s order is the same as a string’s order. The code is shown below.
(* file: plugin_qdemo.ml *)
open Registry
let zero = ""
let begin_name qname = qname ^ "/@begin"
let end_name qname = qname ^ "/@end"
let qprefix qname key = qname ^ "/" ^ key
let next_counter = function
| "" -> "A"
| s ->
begin
let length = String.length s in
let last = length - 1 in
let c = s.[last] in
if c = 'H'
then s ^ "A"
else let () = s.[last] <- Char.chr(Char.code c + 1) in
s
end
let log x=
let k s = let s' = "[plugin_qdemo]:" ^ s in
Lwt.ignore_result (Lwt_log.debug s')
in
Printf.ksprintf k x
let maybe_reset_counters user_db qname b1 =
let e_key = end_name qname in
let exists =
try let _ = user_db # get e_key in true with Not_found -> false
in
if exists
then
let ev = user_db # get e_key in
if ev = b1 then
let b_key = begin_name qname in
let () = user_db # set b_key zero in
let () = user_db # set e_key zero in
()
else
()
else ()
let push user_db vo =
match vo with
| None -> invalid_arg "push None"
| Some v ->
let qname, p1 = Llio.string_from v 0 in
let value, _ = Llio.string_from v p1 in
let e_key = end_name qname in
let b0 =
try user_db # get (end_name qname)
with Not_found -> zero
in
let b1 = next_counter b0 in
let () = user_db # set (qprefix qname b1) value in
let () = user_db # set e_key b1 in
None
let pop user_db vo =
match vo with
| None -> invalid_arg "pop None"
| Some qname ->
let b_key = begin_name qname in
let b0 =
try user_db # get (begin_name qname)
with Not_found -> zero
in
let b1 = next_counter b0 in
try
let k = qprefix qname b1 in
let v = user_db # get k in
let () = user_db # set b_key b1 in
let () = user_db # delete k in
let () = maybe_reset_counters user_db qname b1 in
Some v
with
Not_found ->
let e_key = end_name qname in
let () = user_db # set b_key zero in
let () = user_db # set e_key zero in
None
let () = Registry.register "QDemo.push" push
let () = Registry.register "QDemo.pop" pop
The last two lines register the functions to the Arakoon cluster when the module is loaded.
Compilation
So how do you deploy your user function module into an Arakoon cluster?
First need to compile your module into something that can be dynamically loaded.
To compile the plugin_qdemo.ml I persuade ocamlbuild like this:
ocamlbuild -use-ocamlfind -tag 'package(arakoon_client)' \ -cflag -thread -lflag -thread \ plugin_qdemo.cmxs
It’s not too difficult to write your own testcase for your functionality, so you can run it outside of Arakoon and concentrate on getting the code right.
Deployment
First, you need put your compilation unit into the Arakoon home directory on all your nodes of the cluster. And second, you need to add the name to the global section of your cluster configuration. Below, I show the configuration file for my simple, single node cluster called ricky.
[global] cluster = arakoon_0 cluster_id = ricky ### THIS REGISTERS THE USER FUNCTION: plugins = plugin_qdemo [arakoon_0] ip = 127.0.0.1 client_port = 4000 messaging_port = 4010 home = /tmp/arakoon/arakoon_0
All right, that’s it. Just a big warning about user functions here.
Once a user function is installed, it needs to remain available, with the same functionality for as long as user function calls are stored inside the transaction logs, as they need to be re-evaluated when one replays a transaction log to a store (for example when a node crashed, leaving a corrupt database behind). It’s not a bad idea to include a version in the name of a user function to cater for evolution.
Demo
Let’s use it in a simple python script.
def make_client():
clusterId = 'ricky'
config = Arakoon.ArakoonClientConfig(clusterId,
{"arakoon_0":("127.0.0.1", 4000)})
client = Arakoon.ArakoonClient(config)
return client
if __name__ == '__main__':
client = make_client()
q = ArakoonQueue("qdemo", client)
q.push("bla bla bla")
q.push("some more bla")
q.push("3")
q.push("4")
q.push("5")
print q.pop()
print q.pop()
print q.pop()
print q.pop()
with expected results.
Arakoon 2
With Arakoon 2 we moved to Baardskeerder as a database backend, replacing the combination of transaction logs and Tokyo Cabinet. Since the backend is Lwt-aware, this means that the server side API has become too:
module UserDB : sig type tx = Core.BS.tx type k = string type v = string val set : tx -> k -> v -> unit Lwt.t val get : tx -> k -> (v, k) Baardskeerder.result Lwt.t val delete : tx -> k -> (unit, Baardskeerder.k) Baardskeerder.result Lwt.t end module Registry: sig type f = UserDB.tx -> string option -> (string option) Lwt.t val register: string -> f -> unit val lookup: string -> f end
The major changes are that
- the api now uses Lwt
- we have (‘a,’b) Baardskeerder.result types, which we favour over the use of exceptions for normal cases.
Rewriting the queue implementation to Arakoon 2 yields something like:
(* file: plugin_qdemo2.ml *)
open Userdb
open Lwt
open Baardskeerder
let zero = ""
let begin_name qname = qname ^ "/@begin"
let end_name qname = qname ^ "/@end"
let qprefix qname key = qname ^ "/" ^ key
let next_counter = function
| "" -> "A"
| s ->
begin
let length = String.length s in
let last = length - 1 in
let c = s.[last] in
if c = 'H'
then s ^ "A"
else let () = s.[last] <- Char.chr(Char.code c + 1) in
s
end
let reset_counters tx qname =
let b_key = begin_name qname in
let e_key = end_name qname in
UserDB.set tx b_key zero >>= fun () ->
UserDB.set tx e_key zero
let maybe_reset_counters tx qname (b1:string) =
let e_key = end_name qname in
begin
UserDB.get tx e_key >>= function
| OK _ -> Lwt.return true
| NOK _ -> Lwt.return false
end >>= function
| true ->
begin
UserDB.get tx e_key >>= function
| OK ev ->
if ev = b1
then reset_counters tx qname
else Lwt.return ()
| NOK _ -> Lwt.return ()
end
| false -> Lwt.return ()
let push tx vo =
match vo with
| None -> Lwt.fail (invalid_arg "push None")
| Some v ->
let qname, p1 = Llio.string_from v 0 in
let value, _ = Llio.string_from v p1 in
Lwt_log.debug_f "push:qname=%S;value=%S" qname value >>= fun ()->
let e_key = end_name qname in
UserDB.get tx (end_name qname) >>= fun b0r ->
let b0 = match b0r with
| OK b0 -> b0
| _ -> zero
in
let b1 = next_counter b0 in
UserDB.set tx (qprefix qname b1) value >>= fun () ->
UserDB.set tx e_key b1 >>= fun () ->
Lwt.return None
let pop tx = function
| None -> Lwt.fail (invalid_arg "pop None")
| Some qname ->
begin
let b_key = begin_name qname in
UserDB.get tx (begin_name qname) >>= fun b0r ->
begin
match b0r with
| OK b0 -> Lwt.return b0
| NOK _ -> Lwt.return zero
end
>>= fun b0 ->
let b1 = next_counter b0 in
let k = qprefix qname b1 in
UserDB.get tx k >>= fun vr ->
begin
match vr with
| OK value ->
begin
UserDB.set tx b_key b1 >>= fun () ->
UserDB.delete tx k >>= function
| OK () ->
begin
maybe_reset_counters tx qname b1 >>= fun () ->
Lwt.return (Some value)
end
| NOK e -> Lwt.fail (Failure e)
end
| NOK _ ->
reset_counters tx qname >>= fun () ->
Lwt.return None
end
end
let () = Userdb.Registry.register "QDemo.push" push
let () = Userdb.Registry.register "QDemo.pop" pop
Both client side and deployment remain the same.
Questions asked
Ain’t there something wrong with this Queue?
Yes! Glad you noticed. This queue concept is fundamentally broken. The problem is the pop.
Follow this scenario:
- the client calls the QDemo.pop function
- the cluster pops the value from the queue and its master sends it to the client.
- the client dies before it can read the popped value
Now what? We’ve lost that value. Bloody network, how dare you!
Ok, I admit this was naughty, but it’s a good example of a simple local concept that doesn’t really amount to the same thing when tried in a distributed context. When confronted with this hole, people immediately try to fix this with “Right!, so we need an extra call to …”. To which I note: “But wasn’t this extra call just the thing you were trying to avoid in the first place?”
Why don’t you allow user functions to be written in <INSERT YOUR FAVOURITE LANGUAGE HERE>?
This is a good question, and there are several answers, most of them wrong. For example, anything along the lines of “I don’t like your stinkin’ language” needs to be rejected because a language’s cuteness is irrelevant.
There are several difficulties with the idea of offering user functions to be written in another programming language. For scripting languages like Python, Lua, PHP ,… we can either implement our own interpreter and offer a subset of the language, which is a lot of work with low return on investment, or integrate an existing interpreter/runtime which will probably not play nice with Lwt, or with the OCaml runtime (garbage collector). For compiled languages we might go via the ffi but it’s still way more complex for us. So for now you’re stuck with OCaml for user functions. There are worse languages.
Wouldn’t it be better if you apply the result of the user function to the transaction log iso the arguments?
Well, we’ve been thinking about that a lot before we started with user functions. The alternative is that we record and log the effect of the user function so that we can always replay that effect later, even when the code is no longer available. It’s an intriguing alternative, but it’s not a clear improvement. It all depends on the size of the arguments versus the size of the effect.
Some user functions have a small argument set and a big effect, while for other user functions it’s the other way around.
Closing words
Technically, it’s not too difficult to hook in your own functionality into Arakoon. Just make sure the thing you want to hook in does not have major flaws.
have fun,
Romain.
Caulking your distributed algorithm implementation
Posted: October 25, 2012 Filed under: algorithms, OCaml | Tags: arakoon, distributed systems, graphviz, OCaml 4 Comments »Me: Ok, all unit tests succeed.
All system tests succeed.
All acceptance tests succeed.
Release!
(some time later)
NN: We have a problem with your distributed database.
Me: Ok, what’s the matter?
NN: Once in a while the cluster seems to get stuck.
Me: Stuck how?
NN: It seems to be unable to elect a master.
Me: What did you do to get there?
NN: We don’t know, but we have the logs. Here they are. Happy hunting.
Me: $#%#&*!
(some time later)
Me: Aha, that’s what went wrong!
Introduction
The Arakoon team has spent a big part of the last two years on variations of this theme. Some of the problems turned out to configuration problems (sometimes even of other clusters), but there definitely were situations where the issue was a caused by an implementation error. As it turns out, they all have the same cause: not all possible scenarios were covered. So, is it possible to escape this vicious circle?
We already learned some things (see this previous blog post), so we use asynchronuous message passing and removed all IO from our implementation. So what’s left?
Well we have a set of distributed objects with well defined states and a set of messages that can be exchanged between them. Can’t we just generate all possible states of the distributed system and all possible messages and check how the relationships between them? This is non-trivial as the number of possible states is infinite and so is the number of possible messages. On the other hand, most state combinations are bad and inconsistent, and most messages are irrelevant… There might be something we can do.
Growing a state space-ish diagram
What if we start from a consistent starting state for all objects (let’s call this the system state), generate all relevant messages that can be exchanged from that state and apply them in all possible orders. The system states that can be reached in this way from the starting state should all be consistent. If we find a state that’s not consistent, we stop. For consistent system states, we can iterate. What about inconsistent states? Well, clearly this means our algorithm is capable of doing bad things. We should check the scenario that produced this and fix it, and iterate again. Is this doable? Well, maybe… and what about simulating dropped messages?
Humble beginnings
Let’s start small. What about growing a Basic Paxos implementation for a system of three nodes? Modelling the messages should not be too difficult:
type 'v kind =
| Prepare
| Promise of 'v option
| Accept of 'v
| Accepted
type 'v message = {n:n; s:id; t:id; k:'v kind}
Each message is associated with a paxos round n, has a source s and a target t and has semantics described by its kind k. Finally there’s some value type (‘v) for the things the system should try to find consensus on. (You can find the code on Github)
Modelling the agents is a bit more work:
type 'v agent_state =
| SHalted
| SIdle
| SRequest of ('v proposal)
| SPromised of 'v option
| SLead of ('v * int) (* value, outstanding acks *)
type 'v agent = {
id : id;
pop : id list;
_n : n;
state : 'v agent_state;
store : (n * 'v) list;
}
An agent has an id, knows the other agents (pop), has a store and a current paxos round n. The interesting part is the inner state representing the role it’s playing. It can be halted or idle, requesting to become a leader for a proposal or leading an update.
Now we have messages and agents, we can model the state of the system.
type 'v network = 'v message list
type 'v agents = 'v agent list
type 'v state = { net: 'v network; ags: 'v agents}
The system state is merely a collection of agents (ags) and a network (net) representing the messages that can be delivered to the agents.
How can the state of the system change? First of all, the delivery of a message most likely will have an impact. We might add other moves later.
type 'v move = | DeliverMsg of 'v message
Generating all moves is now easy: for every message in the network, there is a move that delivers it, and since we want to stop in bad states, we don’t generate messages there:
let generate_moves state =
if is_bad state
then []
else
let deliver = List.map (fun x -> DeliverMsg x) state.net in
...
How about executing a move? There is some administration to do there.
We have a move to execute, a current state, a path that was followed to arrive at that state, and a set of observed states. If the move is the delivery of a message, we find the target agent and let him handle the message. This will change the agent’s state and produce some messages (extra).
let execute_move move state path observed =
let deliver m ag =
let agent = find_agent ag m.t in
let agent', extra = A.handle_message agent m in
let ag' = replace_agent agent' ag in
ag', extra
in
let state' =
match move with
| DeliverMsg m ->
let net' = remove m state.net in
let ags',extra = deliver m state.ags in
{ net = net' @ extra; ags = ags'}
Executing a move will cause a new system state. We will record observing the transition from the old state to the new by this move, and create the new path.
let transition = (state_label state, move, state_label state') in
TSet.add transition observed in
state' , (move,state) :: path , observed'
The whole simulator is a few functions away:
let rec check_all level state path observed =
if not (is_consistent state) then raise (Bad (path,state));
if level = limit
then observed
else
let rec loop observed = function
| [] -> observed
| move :: rest ->
let state', path', observed' = execute_move move state path observed in
let observed'' = check_all (level + 1) state' path' observed' in
loop observed'' rest
in
let moves = generate_moves state in
loop observed moves
let run state0 = check_all 0 state0 [] TSet.empty
Basically we start from an initial state, go down the tree of possible moves, execute all these and accumulate observed transitions.
Generating a diagram is trivial with Graphviz. Just iterate over the observed transitions. (not shown here, see on Github for details )
The simulation
We create 3 agents, let the first one start on a value, run our simulator from the starting state, and dottify the observed transitions.
let main () =
let ids = List.map id [1;2;3] in
let a1,a1_out = start (make_agent (Id 1) ids) "x" in
let a2,a2_out = (make_agent (Id 2) ids),[] in
let a3,a3_out = (make_agent (Id 3) ids),[] in
let world = [a1;a2;a3] in
let state0 = {net = a1_out @ a2_out @ a3_out; ags = world} in
try
let observed = M.run state0 in
M.dottify state0 observed
with (M.Bad (path, state)) ->
Printf.eprintf "bad path:\n";
M.dump_path path;
Printf.eprintf "%s\n" (state_label state)
Mark0
Let’s try this on a brain-dead simple implementation of an agent.
One that goes to the halted state as soon as it receives a message, while sending out no message at all.
module Mark0 = (struct let handle_message agent message = halt agent, [] end : ALGO)
What do we see here? First, there is a labelling scheme for states: R1I0I0 means the first agent has n=1 and is in a Requesting state, while the second and third agents ar in Idle state with n=0.
After the delivery of the {Prepare;1->2;n=1} message, a prepare from agent 1 to agent 2, the second agent halts. Likewise for the other prepare message. This looks ok, so let’s move on.
Mark1
Let’s build an agent implementation that covers the happy path.
module Mark1 = (struct
let prepare_when_idle source n agent=
let an = agent._n in
if n > an
then
let pv = None in
let agent' = {agent with state = SPromised pv; _n = n;} in
let msg = {n = n;s = agent.id;t = source; k = Promise pv } in
let out = [msg] in
agent', out
else
halt agent,[]
let promise_for_request (source:id) (mn:n) vo (proposal:'v proposal) agent =
let pv,pballot = proposal in
if mn = agent._n
then
let pballot' = Ballot.register_vote pballot source vo in
if Ballot.have_majority pballot'
then
let value = Ballot.pick_value pballot' in
let outstanding_acks = Ballot.quorum pballot' -1 in
let me = agent.id in
let targets = others agent in
let make_msg t = {n = mn; s = me; t ; k = Accept value} in
let broadcast = List.map make_msg targets in
let agent' = { agent with
store = (mn,value) :: agent.store;
state = SLead (value, outstanding_acks);
}
in
agent', broadcast
else
agent, []
else
halt agent, []
let handle_accept m v agent =
match agent.state with
| SPromised vo when agent._n = m.n ->
let agent' = {agent with state = SIdle; store = (m.n, v) :: agent.store;} in
let out = [{n = m.n;s = agent.id;t = m.s;k = Accepted}] in
agent', out
| _ -> halt agent, []
let handle_accepted m agent =
match agent.state with
| SLead (v,out) when agent._n = m.n ->
let out' = out -1 in
let state' = if out' = 0 then SIdle else SLead(v,out') in
{agent with state = state'},[]
| _ -> halt agent, []
let handle_message agent m =
match m.k with
| Prepare when agent.state = SIdle -> prepare_when_idle m.s m.n agent
| Promise vo ->
begin
match agent.state with
| SRequest p -> promise_for_request m.s m.n vo p agent
| _ -> halt agent, []
end
| Accept v -> handle_accept m v agent
| Accepted -> handle_accepted m agent
| _ -> halt agent,[]
end : ALGO)
What does this do? (click on it to see the full picture)

The good news is that there are quite a number of paths from the initial state I0I0I0 that reach our happy state I1I1I1, but there are also a lot of scenarios that end up in bad states.
Let’s look at one in detail.
R1I0I0:{Prepare;1->3;n=1} --->
R1I0P1:{Promise;3->1;n=1} --->
L1I0P1:{Accept; 1->2;n=1} --->
L1H0P1
What happened here? A Prepare message goes from agent 1 to agent 3. That agent sends a Promise back.
This causes agent 1 to become a leader and broadcast Accept messages. One of these reaches agent 1, which is clueless as it did not receive a Prepare message first. Agent 1 therefore halts.
The diagram allows us to understand scenarios that lead to bad states, and to modify the algorithm accordingly. This process of finding holes in your algorithm,patching them and iterating is something which I call caulking in absence of a better word. In this particular case, an agent that is Idle can receive an Accept for the next n and should be able to move to the Idle state at the next n.
What about dropped messages?
Earlier, I did not answer the question about the simulation of dropped messages. The above scenario should make clear that we are actually, in luck. There is no difference between that scenario and a scenario where a Prepare from agent 1 and agent 2 was dropped. In general, there is no difference between dropping a message and delaying it until it is no longer relevant. This means there is no need for us to simulate them at all!
Mark2
Let’s caulk Mark1. Looking at the diagram, not a lot of things need to be fixed. Here’s a list of messages that go awry.
- Accept;n when agent is Idle at pred n
- Accepted;n when agent is already Idle at n
- Promise;n when agent is already Leading at n
- Promise;n when agent is already Idle at n
Ok, adapting the code is easy:
module Mark2 = (struct
let prepare_when_idle source n agent=
let an = agent._n in
if n > an
then
let pv = None in
let agent' = {agent with state = SPromised pv; _n = n;} in
let msg = {n = n;s = agent.id;t = source; k = Promise pv } in
let out = [msg] in
agent', out
else
halt agent,[]
let promise_for_request (source:id) (mn:n) vo (proposal:'v proposal) agent =
let pv,pballot = proposal in
if mn = agent._n
then
let pballot' = Ballot.register_vote pballot source vo in
if Ballot.have_majority pballot'
then
let value = Ballot.pick_value pballot' in
let outstanding_acks = Ballot.quorum pballot' -1 in
let me = agent.id in
let targets = others agent in
let make_msg t = {n = mn; s = me; t ; k = Accept value} in
let broadcast = List.map make_msg targets in
let agent' = { agent with
store = (mn,value) :: agent.store;
state = SLead (value, outstanding_acks);
}
in
agent', broadcast
else
agent, []
else
halt agent, []
let handle_accept m v agent =
let _accept m =
let agent' = {agent with state = SIdle; store = (m.n, v) :: agent.store;} in
let out = [{n = m.n;s = agent.id;t = m.s;k = Accepted}] in
agent', out
in
match agent.state with
| SPromised vo when agent._n = m.n -> _accept m
| SIdle when (next agent._n) = m.n -> _accept m
| _ -> halt agent, []
let handle_accepted m agent =
match agent.state with
| SLead (v,out) when agent._n = m.n ->
let out' = out -1 in
let state' = if out' = 0 then SIdle else SLead(v,out') in
{agent with state = state'},[]
| SIdle when agent._n = m.n -> agent,[]
| _ -> halt agent, []
let handle_message agent m =
match m.k with
| Prepare when agent.state = SIdle -> prepare_when_idle m.s m.n agent
| Promise vo ->
begin
match agent.state with
| SRequest p -> promise_for_request m.s m.n vo p agent
| SLead(v,out) when agent._n = m.n -> agent, []
| SIdle when agent._n = m.n -> agent, []
| _ -> halt agent, []
end
| Accept v -> handle_accept m v agent
| Accepted -> handle_accepted m agent
| _ -> halt agent,[]
end : ALGO)
Isn’t it nice that fixing the holes in our algorithm actually makes the diagram smaller? Since we don’t end up in bad states anymore, there are way less transitions. It’s also aesthetically pleasing graphviz shows all arrows from left to right, meaning there are no transitions that actually increase the distance between the current state and the state we’re aiming for.
What about agents that are wiped clean?
This kind of calamity is not too difficult to simulate. Basically it’s a move that puts the agent back in its starting state. Let’s add the possibility that one of the agents is wiped.
let generate_moves state =
if is_bad state
then []
else
let deliver = List.map (fun x -> DeliverMsg x) state.net in
let id3 = Id 3 in
let agent = find_agent state.ags id3 in
let wipe =
if is_halted agent
then []
else [Wipe id3]
in
deliver @ wipe
Let’s try that…
./paxos.byte > mark3.dot bad path: 1 ---(1: Prepare) ---> 3 3 ---(1: Promise) ---> 1 1 ---(1: Accept) ---> 2 1 ---(1: Accept) ---> 3 Wipe 3 L1I0I0
Auch. There actually is something wrong here. As it turns out, there is a bug in the Mark2 module.
It’s this fragment that’s wrong:
let handle_accept m v agent =
let _accept m =
let agent' = {agent with state = SIdle; store = (m.n, v) :: agent.store;} in
let out = [{n = m.n;s = agent.id;t = m.s;k = Accepted}] in
agent', out
in
match agent.state with
| SPromised vo when agent._n = m.n -> _accept m
| SIdle when (next agent._n) = m.n -> _accept m
| _ -> halt agent, []
Handling an Accept when the agent is in the Idle state should also set the n correctly (the one of the message). Let’s fix that and try again.
Here it is:

Again, we’re confronted with lots of bad states but we know how to fix that. What are the scenarios that bring us in bad states? As it turns out, these are all caused by old Prepare messages. Adding that and generating the diagram again yields:
Indeed, wiping an agent moves to a state somewhere to the left of the current state, which matches the idea of being further away from our goal.
Are we there yet?
So far, we only addressed cases where there is only 1 agent in a requesting state. So what would happen if there are two agents requesting something at the same time?
Happy caulking!
Closing Remarks
Arriving at a correct implementation of an algorithm is difficult, but even more so in the case of distributed systems. This blog post shows a strategy you can apply in caulking your own implementations. As stated before, making your implementation pure helps you a lot.
Have fun,
Romain.
“How hard can it be?” – on coding, chess and elo
Posted: February 24, 2012 Filed under: Programming | Tags: chess, coding, elo, Programming 15 Comments »Introduction
It happens to all programmers. After some preliminary work, you’re at the point where you can see the solution with your mind’s eye, and all that’s left to be done is to write the code. You split it into steps that you think will bring you success, and start coding. Quickly, you realize it’s not so simple, and a few days later you get a version running but you’re not happy with it. It’s fragile, big, and performing awfully. It has the elegance of a drinking giraffe, and you feel disappointed. Many programmers say to themselves at this point: ‘How hard can it be?’.
Even worse, their managers, who can’t code to save their lives, will ask the same question.
Let me share an insight that comforts me at times like these. I will try to answer that exact question.
A Lower Bound
I will cheat a bit, as trying to establish exactly how difficult programming is is too difficult for me. So I will settle for a lower bound. Let’s find something that’s actually easier than programming. Well, chess seems to be a good candidate.
Chess is a limited game. Its entire universe consists of only 64 squares. There are two armies of maximum 16 pieces, each piece following some simple rules. Players take turns. Each player moves a piece (exceptionally 2) and tries to mate his opponent. In the starting position, one has only 20 possible moves, and a complex position has some 50 possible moves (I have seen composed positions with more than 200 moves) of which maybe 10 seem actually playable. A game can last 200 moves, but a typical game hardly reaches 50 moves. In the end, you can only have a finite amount of possible positions.
Programming on the other hand is far less limited. You can combine algorithms, data structures, programming languages and paradigms in almost an infinite number of combinations. As you write code, the only thing holding you back is that you have to produce something that’s acceptable to the syntax of the programming language you’re using. Other than that you’re free. A programming language like Java has a simple (compared to C++ at least) grammar, but still is very rich in possibilities. You can take the grammar to build small classes. You can then take these classes, and combine them like Lego, to build more complex building blocks, ad infinitum. Event better, your playground (the hardware you work against) becomes more and more powerful so you can try things that were impossible just a few years ago. Compared it to chess, it seems that in programming you can combine pieces and build new ones, enlarge the board, and so on. Come to think of it, it’s safe to say programming is as least as complex as chess.
If someone would still have doubts, I can say that we build chess playing programs that will beat any human. How far are we on programming programs? Ok, Chess is indeed a lower bound. Programming is at least as hard as Chess. It might be a lot harder, but that is too hard to prove, so I’ll settle for ‘at least as hard’.
About Chess
Let’s talk about Chess some more. It is actually quite a mature game. The first professional supposedly was Abu-Bakr Muhammad ben Yahya as-Suli (854-946). He was the strongest player of his time, and author of a book describing a systematic way of playing Shatranj. Apparently, in the Arab World, one can still occasionally hear the complement “he played like Yahya as-Suli”. But he played one of chess’s predecessors.
Modern chess has been around for more than half a millenium, and people are quite surprised to learn that some of the standard manoeuvres have been around that long.
For example, every chess player knows the Lucena position shown below

which is attributed to Luis Ramirez de Lucena, who wrote a chess book published in 1497 (you read correctly: 1497!).
There are a great many chess books around. Google books answers intitle:chess with more than 200000 results. Only a minority of these books concern beginners as most of these target club players and above. It’s also funny to note that such a limited game is rich enough for lot of new books to be published every year.
About Chess Players
One of the interesting things about chess is that every player has an elo rating. This allows you to see exactly how good a player is, and it allows you to calculate the odds of a player winning a game against another player. Fide keeps records of the ratings, and has an online database you can browse. For example, let’s take two of the best players in the world, and compare them.Top dog Magnus Carlsen currently has a rating of 2835. If he’d play the number one US player Hikaru Nakamura, who currently has a rating of 2759, he will score about 0.60, or score about 6 points in a 10 game match.
import math def expected(r_a,r_b): d = r_b - r_a e = d / 400.0 n = 1 + math.pow(10,e) return 1.0/n print expected(2835, 2759)A person's chess rating evolves in time. If one scores better than one's rating predicts, the rating will go up, and vice versa. Improving your chess skill equals an elo gain. For example, Anish Giri's evolution can be found here. In fact, elo is so prevalent, that people use it to refer to an opponent: it is even more important than the opponents name. If a chess player wants to show his game to a collegue, he will say something like:"I've beaten a 1900 yesterday. Let me show you how it went".
Wanna Play?
As it turns out, chess is too hard a problem for most people to master. A master (see International Master) has a rating somewhere above 2400. It is a level only a few people ever achieve (about 1 or 2 percent of chess enthusiasts world wide).
Personally, I can testify it is very difficult: I started playing at the age of 17, invested countless hours, read (at least bought) more than one hundred books, and I only reached a level of 2000. There are some defects in my game I just cannot seem to eliminate, so I must limit my ambition. Talking to people about the reasons for this, they say the main reason is that I started too late. Anyway, if I put in some more effort, maybe I can reach 2100. Maybe.
So You Think You Can Code?
Ok, let's go back to programming. We asserted before that it is at least as difficult as chess, and then learned chess is actually very difficult. So programming is really really hard. A way to improve is by doing it, analysing where you went wrong, and iterate, hoping you improve while doing this.
Just like chess, reading a book about it will not make you a better programmer. If you're lucky, and read a good book on the topic (be it chess or programming), you will get some insights in what you are doing wrong. But these insights themselves will not magically improve you. So you need to put in hard effort. Some of the things I learned that will help you do that effort are:
- Tackle a new type of problem (a chess engine, a rendering engine, a distributed key value store, a compiler, ...)
- Learn a new programming language
- Learn a new programming paradigm(FP,OO,Declarative)
- Learn a new platform(GPU,FPGA,...)
- Learn from a better programmer
Master Coders, Anyone?
Whatever the activity, some humans will be better than others. As with chess, in programming, there will exist some people that have reached 'mastery'. But, I'm a sceptic. I haven't seen any. Moreover, as long as we don't have anything in place like programmer elo I can claim they haven't been observed. Actually, programmer elo can function as a discriminator for authors too, just like it does for chess books (chess book written by a non-master are rare, and being a master is not a guarantee you'll write a good book either, but that's an entirely different discussion)
I'm quite annoyed by the books that come out telling me how to code or to organize the software development process, from people who are clueless. Long gone are the days I used to attach value to a thing merely because it was written down in a book. In fact, the more somebody tells me how to do it, the more suspicious I become. This is a direct result from my interaction with chess masters. Sometimes I ask advice from a chess master on a problem I faced, and most of the time they answer in a sort of fuzzy wordings like "I think you probably should try to arrange your pieces something like this ...". The same position, when shown to a lower rated player often leads to an explanation like this: "you need to first take here and then there, and push that pawn". How can it be that the same position is unclear to a master, while it is clear to a patzer ? I think it comes from respect for the complexity of game. If you see more, you fear more.
Closing Words
Basically, I think two things are worth remembering from this post. First, programming is really really hard. Second, it would be beneficial if there would be a programmer elo. It would at least prevent some authors of gathering enough courage to publish a book on programming. It would certainly allow better hiring decisions, and maybe could help you earn the money you deserve for your efforts. Or are you overpaid?
have fun,
Romain.
Post Scriptum
Some people remarked something like programmer elo couldn't be done because programming isn't a head-to-head competition. As it turns out, this is not necessary. Sites like chesstempo.com show that you can also compete against puzzles. What happens there is that a competitor gets a puzzle. If he solves it, he collects points from the puzzle and the puzzle loses rating points. If the user fails, the puzzle collects points from the user. This way not only the users get sorted, but also the puzzles, which allows the site to always present users puzzles that more or less fit their level. The same concept can be used for any kind of test/examination. The good thing is that people get a tool to calculate their level and can work to improve it. Exactly the kind of feedback loop one needs.
Rediscovering the RSync Algorithm
Posted: February 14, 2012 Filed under: algorithms, OCaml, Programming, Uncategorized | Tags: algorithm, OCaml, optimization, Programming 10 Comments »A:Ok, you’re synchronizing this over the web;
and what do you use for the synchronization?
B: Oh, we implemented the rsync algorithm.
A: uhu. And what do you do with really big files?
B: The same.
A: And you also synchronise folders?
B: Yes.
A: And how do you do that?
B: we iterate over the folder, using the algorithm on every file, recursing over subfolders.
A: Can you try 2 things for me? First, a very large file; and second, a large codebase, and see if it holds.
Introduction
First of all, I am an admirer of the (original) rsync algorithm. (I think) it was a first stab at file synchronization, and it was so good people still implement it today.
But if you don’t understand its strenghts and weaknesses you might be in for a disappointment.
The Problem
You have 2 files, A’ and A that are different but alike. They reside on 2 different locations connected through a network. You want to make A’ identical to A.
The simplest solution is to just copy A, but given the similarity between the two files, there has to be a better way.
Historical Interlude
Networks used to be notoriously bad in the early 90s. Everybody who was transferring archives over large distances instinctively knew about a critical download size.
If the archive was too large, it would take too long, yielding a 100% chance something would go wrong somewhere resulting in an interrupted download. Even if the (up- or) download succeeded, chances were a small part of the file got corrupted, and you had to start over. The two first alleviations to this problem were checksums to detect accidental corruptions, and resumptions (being able to start a download at a certain offset).
RSync took care of interrupted downloads, and also provided a better solution when your file was corrupt. On top of that, it allowed low cost propagation of small changes, opening up a whole new range of applications. System administrators had a shiny new hammer.
The RSync Strategy
RSync just does a single round trip. First it creates a signature of A’, sends it over. On the other location it scans the local file, tries to find parts that are in the signature, while constructing a recipe as a stream of instructions. It’s possible to derive the algorithm starting from a primitive version, improving it step by step.
Since it’s fun too, I’ll be doing that here. While we’re playing, we’ll abstract away from IO, because it clouds the algorithmical view.
Mark 0
Let’s attack this in pure caveman style. Making a signature is splitting the file in blocks of equal size (except maybe the last). Iterating over the blocks, you calculate a digest and accumulate digests and block identifiers. Block identifiers are just their number: the first block has id 0, the second block id 1 aso.
let file_signature f b_size =
let f_len = String.length f in
let rec loop acc s i =
if s = f_len
then acc
else
let b_len = min b_size (f_len - s) in
let b = String.sub f s b_len in
let b_sig = block_signature b in
let acc' = (b_sig,i) :: acc in
loop acc' (s+b_len) (i+1)
in
loop [] 0 0
We have lots of choice to calculate a block signature. Let’s be lazy and pick Digest.string which is the md5 checksum of the block.
let block_signature block = Digest.string block
To recreate the file you need to interprete the stream of instructions. But what would these instructions be?
Well, in this version, you can be told to copy over a block or write a char.
type instruction = | C of char | B of int
Ok, how do you combine the signature together with the new file to generate a stream of instructions?
First thing that comes to mind is to scan over the new file, starting at position s
- consider the block starting at s and try to find it in the signature.
- if you find it, add a B j instruction, and jump a block forward.
- if you miss it, add a C c instruction, and step forward 1 position.
Let’s do that:
let updates f0_sig b_size f1 =
let f1_len = String.length f1 in
let rec loop acc s =
if s = f1_len
then List.rev acc
else
let b_len = min b_size (f1_len - s) in
let block = String.sub f1 s b_len in
let u,step =
try
let d = block_signature block in
let i = List.assoc d f0_sig in
(B i), b_len
with Not_found -> (C block.[0]), 1
in
let acc' = u :: acc in
loop acc' (s + step)
in
loop [] 0
The last step in our synchronisation scheme is to create a file using the old file,
together with the stream of instructions.
let apply old b_size ins =
let old_len = String.length old in
let buf = Buffer.create old_len in
let add_block i =
let s = i * b_size in
let b_len = min b_size (old_len - s) in
let block = String.sub old s b_len in
Buffer.add_string buf block
in
let add_char c = Buffer.add_char buf c in
let () = List.iter (function
| B i -> add_block i
| C c -> add_char c)
ins
in
Buffer.contents buf
So it took 60 lines of code to have a first stab at a synchronisation algorithm.
Let’s try this on an example:
let bs = 5
let remote = "aaaaabbbbbcccccdddddeeeeefffffggggghhhhhiiiiijjjjjkkk"
let mine = "aaaaabXbbbcccccddddde012"
let mine_s = file_signature mine bs
let delta = updates mine_s bs remote
let mine2 = apply mine bs delta;;
val bs : int = 5
val remote : string = "aaaaabbbbbcccccdddddeeeeefffffggggghhhhhiiiiijjjjjkkk"
val mine : string = "aaaaabXbbbcccccddddde012"
val mine_s : (Digest.t * int) list =
[("$\240\t\221\19200\222\199\2035\190|\222~#\n", 4);
("P\248M\175:m\253j\159 \201\248\239B\137B", 3);
("g\199b'k\206\208\158\228\22314\2137\209d\234", 2);
("\196\148\"\21926Lm\179V E=\201O\183,", 1);
("YO\128;8\nA9n\214=\2029P5B", 0)]
val delta : instruction list =
[B 0; C 'b'; C 'b'; C 'b'; C 'b'; C 'b'; B 2; B 3; C 'e'; C 'e'; C 'e';
C 'e'; C 'e'; C 'f'; C 'f'; C 'f'; C 'f'; C 'f'; C 'g'; C 'g'; C 'g';
C 'g'; C 'g'; C 'h'; C 'h'; C 'h'; C 'h'; C 'h'; C 'i'; C 'i'; C 'i';
C 'i'; C 'i'; C 'j'; C 'j'; C 'j'; C 'j'; C 'j'; C 'k'; C 'k'; C 'k']
val mine2 : string = "aaaaabbbbbcccccdddddeeeeefffffggggghhhhhiiiiijjjjjkkk"
Ok, it works, but how good is it?
Before I can answer this, first a note on block size. There are some ‘forces’ to be reckoned with
- the blocksize needs to be big compared to the block signature.
- if the blocksize is big, you will find less matches between the signature and the new file, so you need send more data back
- if the blocksize is small, you have lots of them, meaning your signature will be bigger
and you need an efficient lookup
The best blocksize will be depend not only on the file size, but on the actual changes.
How important is it really?
Well, let’s sync 2 images:
![]() |
![]() |
These 2 images are bitmaps of 5.5 MB each (stored as .bmp).
(I actually uploaded smaller versions as it seems useless to let your browser download more than 10MB for just 2 silly image samples)
I’ll sync’em using different blocksizes and see what gives.
let main () =
let bs = int_of_string (Sys.argv.(1)) in
let () = Printf.printf "bs=%i\n" bs in
let remote = get_data "new_image.bmp" in
let () = Printf.printf "remote: size=%i%!\n" (String.length remote) in
let mine = get_data "old_image.bmp" in
let mine_s = X.file_signature mine bs in
let () = Printf.printf "mine_s: len=%i%!\n" (Array.length mine_s) in
let delta = X.updates mine_s bs remote in
let (nbs,ncs) = List.fold_left (fun(bs,cs) i ->
match i with
| X.B _ -> (bs+1,cs)
| X.C _ -> (bs,cs+1)
) (0,0) delta
in
let mine2 = X.apply mine bs delta in
let () = Printf.printf "mine2: size=%i%!\n" (String.length mine2) in
let () = Printf.printf "bs=%i;cs=%i\n" nbs ncs in
| block size | # block signatures | blocks | chars | time |
| 8192 | 704 | 535 | 1384448 | 71s |
| 4096 | 1407 | 1084 | 1323008 | 92s |
| 2048 | 2813 | 2344 | 960512 | 92s |
| 1024 | 5626 | 4995 | 646144 | 115s |
| 512 | 11251 | 10309 | 482304 | 172s |
| 256 | 22501 | 20972 | 391424 | 283s |
| 128 | 45001 | 42508 | 319104 | 537s |
The first column is the block size. The second is the number of blocks in the file, the third is the number of B instructions and the fourth is the number of C instructions.
The last columns is measured execution time on my laptop. Processing time is the biggest issue. Ocaml is fast, but not fast enough to compensate for my brutal inefficiency. Imagine what it would do to a 4GB movie.
Mark 1
The problem is quickly revealed: List.assoc is not suited for long lists.
A better solution is use an array, sort it on block signature, and use binary search
(using a hashtable would be viable too).
let block_signature block = Digest.string block
let file_signature f b_size =
let f_len = String.length f in
let s = ref 0 in
let n_blocks = (f_len + b_size -1) / b_size in
Array.init n_blocks
(fun i ->
let start = !s in
let b_len = if start + b_size > f_len then f_len - start else b_size in
let b = String.sub f start b_len in
let b_sig = block_signature b in
let () = s := start + b_len in
(b_sig,i)
)
type instruction =
| C of char
| B of int
let updates f0_sig b_size f1 =
let my_cmp (s0,i0) (s1,i1) = String.compare s0 s1 in
let () = Array.sort my_cmp f0_sig in
let len = Array.length f0_sig in
let rec lookup b=
let rec find min max =
let mid = (min + max) / 2 in
let (ms,mi) = f0_sig.(mid) in
if ms = b
then mi
else if min > max then raise Not_found
else if b > ms then find (mid+1) max
else find min (mid -1)
in
find 0 (len -1)
in
let f1_len = String.length f1 in
let rec loop acc s =
if s = f1_len
then List.rev acc
else
let b_len = min b_size (f1_len - s) in
let block = String.sub f1 s b_len in
let u,step =
try
let d = block_signature block in
let i = lookup d in
(B i), b_len
with Not_found -> (C block.[0]), 1
in
let acc' = u :: acc in
loop acc' (s + step)
in
loop [] 0
let apply old b_size ins =
let old_len = String.length old in
let buf = Buffer.create old_len in
let add_block i =
let s = i * b_size in
let b_len = min b_size (old_len - s) in
let block = String.sub old s b_len in
Buffer.add_string buf block
in
let add_char c = Buffer.add_char buf c in
let () = List.iter (function
| B i -> add_block i
| C c -> add_char c)
ins
in
| block size | # block signatures | blocks | chars | time(s) |
| 8192 | 704 | 535 | 1384448 | 41 |
| 4096 | 1407 | 1084 | 1323008 | 20 |
| 2048 | 2813 | 2344 | 960512 | 7.8 |
| 1024 | 5626 | 4995 | 646144 | 1.9 |
| 512 | 11251 | 10309 | 482304 | 1.1 |
| 256 | 22501 | 20972 | 391424 | 0.8 |
| 128 | 45001 | 42508 | 319104 | 0.9 |
Wow, this is quite unexpected (but we’re not complaining). So what happened? Well, the lookup was so dominating, it was cloaking all other behaviour.
Now, with the lookup out of the way, other things can be observed. The problem now is that a ‘miss’ not only causes a C instruction to be emitted, but more importantly, it causes another digest to be calculated. The less blocks are found, the higher the processing time.
Mark 2
The cost of the miss was solved by Andrew Tridgell by introducing a second, weaker hash function.
He used the Adler-32 checksum which is a rolling checksum. ‘Rolling’ means that
adler32(buffer[a+1:b+1])= cheap(adler32(buffer[a:b]), which makes it suitable for our problem here. The ocaml standard library does not have an adler32 module, so I hacked one up.
It’s a naieve implementation in that it follows the definition closely. In fact, most of the modulo operations can be avoided by doing some extra administration.
I didn’t bother.
module Adler32 = struct
type t = {mutable a: int; mutable b: int; mutable c: int}
let padler = 65521
let make () = {a = 1 ;b = 0; c = 0}
let from buf offset length =
let too_far = offset + length in
let rec loop a b i=
if i = too_far
then {a; b; c = length}
else (* modulo can be lifted with a bit of math *)
let a' = (a + Char.code(buf.[i])) mod padler in
let b' = (b + a') mod padler in
loop a' b' (i+1)
in
loop 1 0 offset
let reset t = t.a <- 1;t.b <- 0; t.c <- 0
let digest t = (t.b lsl 16) lor t.a
let out t c1 =
let x1 = Char.code c1 in
t.a <- (t.a - x1) mod padler;
t.b <- (t.b - t.c * x1 -1) mod padler;
t.c <- t.c - 1
let rotate t c1 cn =
let up x = if x >= 0 then x else x + padler in
let x1 = Char.code c1 in
let xn = Char.code cn in
t.a <- up ((t.a - x1 + xn) mod padler);
t.b <- let f = (t.c * x1) mod padler in
let r = (t.b - f + t.a -1) mod padler in
up r
let update t buf offset length =
let too_far = offset + length in
let rec loop i =
if i = too_far then ()
else
let x = Char.code buf.[i] in
let () = t.a <- (t.a + x) mod padler in
let () = t.b <- (t.b + t.a) mod padler in
let () = t.c <- t.c + 1 in
loop (i +1)
in
loop offset
let string block =
let t = from block 0 (String.length block) in
digest t
end
Adler32 is much weaker than md5 and using it exposes you to collisions. So in fact, it acts as a cheap test that might yield false positives. That’s the reason the rsync algo keeps both: if the adler32 of the buffer is found in the signature, there is a second check to see if the md5s match. The fact one sometimes needs to rotate the checksum and at other times needs to reinitialize if from a part of the buffer, complicates the calculation of the updates a bit.
The code is shown below.
let updates f0_sig b_size f1 =
let my_cmp (wh0,sh0,i0) (wh1, sh1,i1) = compare wh0 wh1 in
let () = Array.sort my_cmp f0_sig in
let len = Array.length f0_sig in
let rec lookup w =
let rec find min max =
let mid = (min + max) / 2 in
let (mw, ms,mi) = f0_sig.(mid) in
if mw = w
then (ms,mi)
else if min > max then raise Not_found
else if w > mw then find (mid+1) max
else find min (mid -1)
in
find 0 (len -1)
in
let f1_len = String.length f1 in
let weak = Adler32.from f1 0 b_size in
let rec loop acc b e =
if b = f1_len
then List.rev acc
else
let wh = Adler32.digest weak in
let step_1 () =
let bc = f1.[b] in
let a = C bc in
let b' = b + 1 in
if e +1 < f1_len
then
let e' = e + 1 in
let ec = f1.[e] in
let () = Adler32.rotate weak bc ec in
loop (a:: acc) b' e'
else
let e' = e in
let () = Adler32.out weak bc in
loop (a:: acc) b' e'
in
try
let (s0,i0) = lookup wh in
let sh = Digest.substring f1 b (e - b) in
if s0 = sh
then
let b' = e in
let e' = min f1_len (e + b_size) in
let a = B i0 in
let () = Adler32.reset weak in
let () = Adler32.update weak f1 b' (e' - b') in
loop (a :: acc) b' e'
else step_1 ()
with Not_found -> step_1 ()
in
loop [] 0 b_size
The code indeed is a bit messier as we have more things to control at the same time, but it’s still quite small. Let’s see how wel it performs:
| block size | # block signatures | blocks | chars | time(s) |
| 8192 | 704 | 535 | 1384448 | 0.9 |
| 4096 | 1407 | 1084 | 1332008 | 0.9 |
| 2048 | 2813 | 2344 | 960512 | 0.8 |
| 1024 | 5626 | 4995 | 646114 | 0.6 |
| 512 | 11251 | 10309 | 482304 | 0.6 |
| 256 | 22501 | 20401 | 537600 | 0.7 |
| 128 | 45001 | 42508 | 319104 | 0.7 |
This almost completely removes the cost of a miss; at least for things of this size. The choice of blocksize does however affect the amount of data you need to send over.
There is a lot of other things you can do here:
- Block Size Heuristic: something like O(sqrt(file_size))
- SuperInstructions: make instructions for consecutive Blocks, and consecutive Chars
- Emit function: don’t accumulate the updates, but emit them (using a callback function)
- Smaller signatures: you can consider to drop some bytes from the strong hash
- IO
- Compress update stream
- …
The last remaining problem is that some modifications are not handled well by the algorithm (for example 1 byte changed in each block).
Maybe you could try a better algorithm.
There are lot’s of them out there, and they are worth checking out. (Before you know it you’ll be dabling into merkle trees or set reconciliation )
Anyway, I already exceeded my quotum for this post, but I still want to say a little thing about synchronisation of folders
The Next Problem
You have 2 trees of files, A’ and A that are different but alike. They reside on 2 different locations connected through a network. You want to make A’ identical to A.
What Not To Do
Don’t walk the folder and ‘rsync’ each file you encounter.
A small calculation will show you how bad it really is.
Suppose you have 20000 files, each 1KB. Suppose 1 rsync costs you about 0.1s
(reading the file, sending over the signature, building the stream of updates, applying them).
This costs you about 2000s or more than half an hour. System administrators know better:
they would not hesitate: “tar the tree, sync the tars, and untar the synced tar”.
Suppose each of the actions takes 5s (overestimating) you’re still synced in 15s.
Or maybe you can try unison. It’s ocaml too, you know.
have fun,
Romain.
Cloud Search Using Suffix Arrays ? Well, … maybe.
Posted: February 3, 2012 Filed under: algorithms, Python, Uncategorized | Tags: cloud, python, suffix arrays, text search 2 Comments »Suffix Arrays are arrays that allow you to find exact substring matches. The core idea is that you generate a sorted array of positions using a comparison function that compares the suffixes starting at the respective positions.
Constructing a suffix array
It’s one of the cases where a few lines of code is more clarifying than the explanation itself.
def make_suffix_array(text):
size = len(text)
def compare_suffix(i,j):
return cmp(text[i:],text[j:])
indexes = range(size)
indexes.sort(cmp = compare_suffix)
return indexes
If you try this on the string “Suffix Arrays Rock?” and print out the sorted array’s indexes along with the suffixes that start there, you start to see both potential, as well as weaknesses.
| pos | suffix |
| 06 | ‘ Arrays Rock?’ |
| 13 | ‘ Rock?’ |
| 18 | ‘?’ |
| 07 | ‘Arrays Rock?’ |
| 14 | ‘Rock?’ |
| 00 | ‘Suffix Arrays Rock?’ |
| 10 | ‘ays Rock?’ |
| 16 | ‘ck?’ |
| 02 | ‘ffix Arrays Rock?’ |
| 03 | ‘fix Arrays Rock?’ |
| 04 | ‘ix Arrays Rock?’ |
| 17 | ‘k?’ |
| 15 | ‘ock?’ |
| 09 | ‘rays Rock?’ |
| 08 | ‘rrays Rock?’ |
| 12 | ‘s Rock?’ |
| 01 | ‘uffix Arrays Rock?’ |
| 05 | ‘x Arrays Rock?’ |
| 11 | ‘ys Rock?’ |
Note: This method of creating suffix arrays is criminally inefficient, but it suffices for explaining the concept. There are algorithms to do this in linear time.
You can use the suffix array to search for a string in the text using binary search. Something like this:
def find(sa, text, q):
size = len(sa)
qsize = len(q)
hi = size -1
lo = 0
while hi >= lo:
mid = (hi + lo) / 2
begin = sa[mid]
end = min(size, begin + qsize)
test = text[begin: end]
if test > q:
hi = mid -1
elif test < q:
lo = mid + 1
else:
return begin
return None
You have a fast search that allows you to return the exact position of the substring.
Even better: all matches are clustered together in the suffix array. Perfect locality.
Multiple documents
The concept easily extendeds itself to multiple documents.
def make_multi(texts):
indexes = []
nt = len(texts)
for d in xrange(nt):
size = len(texts[d])
for i in xrange(size):
e = d,i
indexes.append(e)
def compare(e0,e1):
d0, i0 = e0
d1, i1 = e1
s0 = texts[d0][i0:]
s1 = texts[d1][i1:]
return cmp(s0,s1)
indexes.sort(cmp = compare)
return indexes
A minimal example with 3 very small texts, and a bit of code to dump the suffix array, together with the suffixes.
def print_multi(sam, texts):
for e in sam:
d, i = e
suffix = texts[d][i:]
print "(%2i,%2i)\t%s" % (d,i,suffix)
texts = ["Suffix Arrays Rock?",
"Redundant Array of Inexpensive Disks",
"Metal is not Rock!"]
r0 = make_multi(texts)
print_multi(r0, texts)
yields:
( 1, 9) Array of Inexpensive Disks ( 0, 6) Arrays Rock? ( 1,30) Disks ( 1,18) Inexpensive Disks ( 2,12) Rock! ( 0,13) Rock? ( 2, 5) is not Rock! ( 2, 8) not Rock! ( 1,15) of Inexpensive Disks ... ( 1,28) ve Disks ( 0, 5) x Arrays Rock? ( 1,22) xpensive Disks ( 1,14) y of Inexpensive Disks ( 0,11) ys Rock?
Ok, this seems to have almost perfect scalability properties. What’s the catch?
The catch
There are some severe downsides to Suffix Arrays.
- Text needed during search
A major disadvantage to suffix arrays is that you need access to the text while you do a search. It means that the suffix array and the text need to be kept close to each other. - Storage Needs
If you think of it, a text is an array of chars (ascii, not unicode). Its suffix array is an array of positions. Suppose you store a position and document ids as a 32bit words. Remember you need to store the text, as well as the suffix array. If the text size is n, the total needs are 9n. To make matters worse: since you’ll be jumping around in the suffix array, as well as the text, compression schemes are not really feasible. - Updates?
Even the slightest update to the text will invalidate the suffix array, and you need to reconstruct it. Even worse, if you have an array over multiple documents, an update in one of the documents will invalidate the entire structure. - Exact Matches
You get what you asked for. Searching the example text for ‘suffix’ will not yield a result. - Overkill
You can search for an exact match on a string of arbitrary length, but nobody needs to find exact matches on large strings. There is one notable exception: If you’re trying to look up DNA fragments in a large DNA string, this is exactly what you need. For everybody else, it’s overkill.
Getting better all the time
Well, it couldn’t get much worse now could it?
It doesn’t need to be this bad, and you can actually pick how much pain you want to inflict on yourself.
- search starts at start of words
Most of the time, it’s not needed to be able to search for substrings of words. For example, you probably don’t need to find the ‘rays’ in ‘arrays’. It divides the number of positions to sort by the average word size. - remove language noise
- old versions can be blacklisted
If all versions of all your documents have a unique id, you can blacklist older unwanted document ids, and filter them out before you return the results. This means you don’t have to do updates immediately, but you can schedule them when it’s cheapest for you. - text size is not document size
Suppose your documents are pdfs. The actual size of the text is small compared to all the other information in the file (markup, images, fonts, …). Moreover, storing text will not eat your storage. For example, a heavyweight (in every way) novel like War And Peace takes about 3.2 MB. The entire corpus of english wikipedia articles, which is stored as xml is just over 30 GB. peanuts. - multiple suffix arrays
You don’t need add all your documents to one big suffix array. You can have many of them wich improves the situation in more than one way. Search can be parallel, and adding a new document is cheaper as well.
Not all words need to be considered. In fact, the most frequent words are noise.
What does this have to do with ‘The Cloud’ ?
Since there is a network between the client issuing the request and the cloud infrastructure offering the search service, means you have a non-neglectable latency.
This means the latency added by the service just needs to be small compared to the overall latency, which makes our life a bit easier.
Also, the nature of ‘The Cloud’ pushes people to store consolidated instead of live data, which is exactly what is best suited for suffix arrays.
Another prerequisite for suffix arrays is the availability of the text during search, which is exactly what happens in a cloud setup.
Closing words
Suffix arrays are an idea that was conceived back when ‘online’ still had other semantics. At the time (1991) they were seen as too expensive for general search, but that was back when ‘big hard disks’ had less than 100MB of capacity. Maybe Suffix Arrays deserve a second look.
I can feel the urge to whip up an email search/indexing service in a few hundred lines of code, or a private wikipedia search service.
Just one question remains: what shall it be? ocaml or haskell? hm, tough one.
have fun,
Romain.
Number of trailing zeroes
Posted: January 24, 2012 Filed under: algorithms, Programming | Tags: bitset, De Bruijn, graph theory 3 Comments »Last time, I hinted about using De Bruijn sequences for speeding up iteration of bit sets. It’s an old trick that (I think) was discovered by chess programmers in the 1960s (I really need a reference for this).
It’s a classic that yields elegant and quintessential code, but really needs a sidebar explanation of what’s going on.
The problem
For a binary number, write a function that determines the number of trailing zeroes, ie how many times you can divide this number by 2 without remainder.
The trivial implementation is this:
let rec ntz n =
let rec loop acc n =
if n land 1 = 1
then acc
else loop (acc+1) (n lsr 1)
loop 0 n
It’s surprising to learn it’s possible to find the answer without looping at all.
Least significant one
If you look at the problem, you quickly realize there is only one 1 that matters. The least significant one.
It can be isolated like this:
lso = n land ((lnot n) + 1)
or (shorter)
lso = n land (-n)
Let’s look at a few examples for byte size numbers
| n | bin(n) | bin (lnot n) | bin (-n) | lso |
| 2 | 0000 0010 | 1111 1101 | 1111 1110 | 0000 0010 |
| 7 | 0000 0111 | 1111 1000 | 1111 1001 | 0000 0001 |
| 40 | 0010 1000 | 1101 0111 | 1101 1000 | 0000 1000 |
| 48 | 0011 0000 | 1100 1111 | 1101 0000 | 0001 0000 |
Now comes a bit of a detour into graph theory.
De Bruijn Graphs
A detailed introduction is found here, but the important thing to remember is how you construct this graph. The example below is a (2,3) graph. You start with a vertex 000. Then, starting at a vertex, you slide in (shift left and append) either a 1, or a 0, and draw an edge to the result node. Iterate and stop when you have them all.
So after the first step, you have this:
after the next step, you have:
and the closed graph is:
De Bruijn sequences
A De Bruijn sequence is just a shorthand for an Hamiltionian path (a path that visits every vertex once) through a De Bruijn graph. An example of an Hamiltonian path for the full graph above is:
000 -> 001 -> 010 -> 101 -> 011 -> 111 -> 110 -> 100 -> 000
A compact representation for this path can be recorded by writing down the start point and the choices made at every point. The example path yields following sequence:
00010111
When sliding through the sequence, one bit at a time with a window of 3 bits, each of the 8 possible 3 digit sequences appears exactly once.
| n | index |
| 000 | 0 |
| 001 | 1 |
| 010 | 2 |
| 011 | 4 |
| 100 | 7 |
| 101 | 3 |
| 110 | 6 |
| 111 | 5 |
The trick
The table above shows you at each node how many steps you’re from the source of the graph.
So if you multiply the sequence with the least significant one or our number, we can use the table to find out how many shifts we had.
for example: 00010111 * 0010000 = 00101110000. You keep the right most 8 bits 0111000.
By looking at the leftmost 3 (= 8-5) bits 011, which you find by shifting 5 positions to the right,
you see that they are at index 4 in the table. So the number has 4 trailing zeroes.
Summary
To find the number of trailing zeroes for an 8 bit number, you need
- an 8-bit De Bruijn sequence: debruijn = 00010111
- a table of 8 precomputed entries 0,1,2,4,7,3,6,5
- the following formula:
lso = n & (-n) ntz = table[(lso * debruijn) >> 5]
64 bit version
If you want to do this for 64 bits, you follow the exact same reasoning. You need a 64-bit debruijn sequence (there are 2^26 of them), and a precomputed table of size 64.
#define DEBRUIJN 0x22fdd63cc95386dULL
static uint32_t table [64] ;
void inittable (){
uint64_t db = DEBRUIJN ;
int i ;
for (i=0 ; i < 64;++ i ) {
table [db >> 58] = i ;
db = db <<1;
}
}
uint32_t ntz ( uint64_t n){
n &= −n ;
n ∗= DEBRUIJN ;
n >>= 58;
return table [n] ;
}
Use in bitsets
Suppose you’re representing a bitset as an array of 64 bit words. When iterating, you slide through the words, and if a bit is 1 you do your thing. If the rest of the word you’re sliding through only contains zeroes, you’re wasting effort. Knowing the number of trailing zeroes allows you to calculate the last index for the loop over the bits of a word.
A look at a random standard library
Java
There is a java.util.BitSet class that has been in the java standard library since 1.0. Since the source is open, we can inspect it. Let’s take a peek.
The iterator uses
Long.numberOfTrailingZeros(word)
and the implementation there is:
public static int numberOfTrailingZeros(long i) {
// HD, Figure 5-14
int x, y;
if (i == 0) return 64;
int n = 63;
y = (int)i; if (y != 0) { n = n -32; x = y; } else x = (int)(i>>>32);
y = x <<16; if (y != 0) { n = n -16; x = y; }
y = x << 8; if (y != 0) { n = n - 8; x = y; }
y = x << 4; if (y != 0) { n = n - 4; x = y; }
y = x << 2; if (y != 0) { n = n - 2; x = y; }
return n - ((x << 1) >>> 31);
}
(you can read the full source here: Long.java )
The “HD” reference in the source code points to Henry S. Warren, Jr.’s Hacker’s Delight, (Addison Wesley, 2002).
I will not comment on this.
Other languages
Left as an exercise for the reader
have fun,
Romain.
Share your mistakes: adventures in optimization
Posted: January 17, 2012 Filed under: algorithms, OCaml, Programming | Tags: bitset, josephus, OCaml, optimization, sampling methods, sparse set 6 Comments »I used to think I knew the laws of code optimization. In my (not so) humble opinion they were
- profile before you optimize
- after profiling tells you what the problem is, first try a better strategy (algorithm or data structure)
- tweak code as a last resort
It’s a pure economical reasoning that’s behind this: if your code is not fast enough, first find the biggest culprit and eliminate it. By taking out the biggest you get the most value for money, and using something that yields orders of magnitude, gain the most. Tweaking code or moving to a more low-level programming language can only give you a factor of improvement, so if you have the choice, use the bigger gun.
Suppose, as an example, profiling reveals your cost can be written like this:
Cost = 0.8 * BiggestCulprit + 0.2 * EverythingElse
You know what to do: kill the BiggestCulprit. Btw, Pareto tells you it’s commonly something like that (80-20). Ok, using a big hammer you replaced the BiggestCulprit with something that’s 100 times cheaper.
Cost2 = 0.8 * 0.01 * BiggestCulprit + 0.2 * EverythingElse = 0.208 * Cost
If you need to go even faster, you should try to optimize EverythingElse. Can you do this? Depends. Maybe you can split EverythingElse in
EverythingElse = 0.8 * TheNextHurdle + 0.2 * NotInteresting
If you can’t. It ends here.
The strategy is rational, but sometimes profiling points you in the wrong direction.
An example of a mistake I made
What follows below is an account of what happened to a piece of code over a period of two years. I hope you will, when reading on conclude that the mistakes were obvious, but at the time, they weren’t. Hindsight is 20/20.
The problem
As a small step in solving a bigger problem, we needed to generate a sample of size n from a set of size p. Important detail: no value can be selected more than once.
The population size (p) is roughly somewhere between 4000 and 16000, while the sample size is often very small, sometimes more than 2000, but never more than 2500 (we know its distribution).
Let’s look at the problem in isolation. The code shown below is a micro benchmark that is representative for our case, and we’re interested in minimizing the total running time by improving the implementation of the Sampler module
let micro_bench ns k p =
let test_one n =
let sample = Sampler.create n p in
let use_choice _ = () in
let rec loop k =
if k = 0
then ()
else
begin
if k mod 1000 = 0 then Printf.eprintf ".%!";
let () = Sampler.fill_sample sample n p in
let () = Sampler.iter sample use_choice in
let () = Sampler.clear_sample sample in
loop (k-1)
end
in
let t0 = Unix.gettimeofday() in
let () = loop k in
let t1 = Unix.gettimeofday() in
let d = t1 -. t0 in
Printf.printf "%i | %f \n" n d
in
List.iter test_one ns;;
let main () =
let k = 100 * 1000 in
let p = 10000 in
micro_bench [1;2;3;4;5;6;7;8;9;10;20;40;80;160;320;640;1000;2000;2500] k p;;
let () = main ();;
Our solution must adhere to the following signature:
module type S = sig type t val create : int -> int -> t val fill_sample: t -> int -> int -> unit val clear_sample: t -> unit val iter: t -> (int -> unit) -> unit end
The first solution, the one I coded in search of correctness and simplicity, was exactly that, simple and correct:
module S0 = (struct
type t = {mutable c: int; es:int array}
let create n p = {c = 0; es = Array.make n 0}
let in_sample t x =
let rec loop i =
if i < 0 then false
else
if t.es.(i) = x
then true
else loop (i-1)
in
loop (t.c -1)
let add2_sample t x =
let i = t.c in
t.es.(i) <- x;
t.c <- i+1
let fill_sample sample n p =
let rec loop i =
if i = 0
then ()
else
let rec find_new () =
let x = random_range p in
if in_sample sample x
then find_new()
else add2_sample sample x
in
let () = find_new () in
loop (i-1)
in
loop n
let clear_sample t = t.c <- 0
let iter t f =
let rec loop i =
if i = t.c
then ()
else
let () = f t.es.(i) in
loop (i+1)
in
loop 0
end : S)
The sample is accumulated in an array, and we test each candidate to see if we have it already. If so, we try again. Clearing the sample is putting the counter to zero, and iteration is merely iterating over the used part of the array. Simple enough, and it suffised for almost 6 months. A run of the microbenchmark (it takes some 1700 seconds) reveals what’s going on:
1 | 0.017802 2 | 0.017753 3 | 0.025648 4 | 0.033298 5 | 0.040910 6 | 0.050635 7 | 0.056496 8 | 0.065127 9 | 0.073126 10 | 0.081244 20 | 0.170436 40 | 0.402476 80 | 1.060872 160 | 3.131289 320 | 10.381503 640 | 36.543450 1000 | 85.969717 2000 | 408.716565 2500 | 1127.268196
The first column is sample size, the second is time needed for 100000 samples. As you can see, it’s really fast for small sample sizes, but quickly succumbs. Profiling shows it’s the in_sample function that’s to blame. It must scan the entire sample array so far. It gets even worse as the chance of picking an element that was chosen before increases.
Well, it isn’t that difficult to have a better membership test. The population size isn’t that big, so we can afford a BitSet. Adding a member in O(1), membership testing in O(1)… let’s do it, it should fly.
module S1 = (struct
type t = bool array
let create n p = Array.make p false
let in_sample t x = t.(x)
let add2_sample t x = t.(x) <- true
let clear_sample t =
let rec loop i =
if i < 0 then ()
else
let () = t.(i) <- false in
loop (i-1)
in
loop (Array.length t -1)
let fill_sample sample n p =
let rec loop i =
if i = 0
then ()
else
let rec find_new () =
let x = random_range p in
if in_sample sample x
then find_new()
else add2_sample sample x
in
let () = find_new () in
loop (i-1)
in
loop n
let iter t f =
let s = Array.length t in
let rec loop i =
if i = s then ()
else
let () = if t.(i) then f i in
loop (i+1)
in
loop 0
end : S)
Let’s see what this does.
1 | 3.760345 2 | 3.716187 3 | 3.730672 4 | 3.795472 5 | 3.799934 6 | 3.961258 7 | 3.804574 8 | 3.775391 9 | 3.807858 10 | 3.914987 20 | 3.949764 40 | 4.159262 80 | 4.430131 160 | 4.953897 320 | 6.132642 640 | 8.438193 1000 | 11.140795 2000 | 19.150232 2500 | 23.508719
It takes some 124 seconds to run it. Overall, it’s more than 10 times faster, but the small samples are a lot slower, so what happened?
A closer look (with the profiler) revealed 2 things:
- Clearing the bitset is O(p)
- Iterating the bitset also is O(p)
So we tried to remedy this by using a better representation of a bitset. An array of 64 bit words. Clearing is a lot faster there.
Iteration will be faster too as the bitset is expected to be sparse, and one can skip forward by inspecting the numberOfTrailingZeroes.
We optimized the clearing of the bitset, and dabbled into De Bruijn sequences for super fast iteration.
It’s a bit of topic, and maybe interesting enough for another post. The reason why I’m not digressing here is that it was the wrong road to go down to.
In the end, after a long detour, we settled on an entirely different strategy: Sparse Sets.
module S2 = (struct
type t = { mutable n: int;
mutable dense: int array;
mutable sparse: int array;}
let create n p =
{ n = 0;
dense = Array.make p 0;
sparse = Array.make p 0;
}
let add2_sample t x =
let n = t.n in
t.dense.(n) <- x;
t.sparse.(x) <- n;
t.n <- (n+1)
let in_sample t x =
let rsi = t.sparse.(x) in
let ok = rsi < t.n in
ok && (t.dense.(rsi) = x)
let iter t f =
let n = t.n in
let rec loop i =
if i = n then ()
else
let x = t.dense.(i) in
let () = f x in
loop (i+1)
in
loop 0
let clear_sample t = t.n <- 0
let fill_sample sample n p =
let rec loop i =
if i = 0
then ()
else
let rec find_new () =
let x = R.random_range p in
if in_sample sample x
then find_new()
else add2_sample sample x
in
let () = find_new () in
loop (i-1)
in
loop n
end: S)
Let’s see what this does:
1 | 0.019265 2 | 0.021046 3 | 0.030151 4 | 0.034281 5 | 0.040782 6 | 0.048158 7 | 0.055332 8 | 0.061747 9 | 0.068712 10 | 0.075462 20 | 0.144088 40 | 0.276297 80 | 0.539943 160 | 1.069994 320 | 2.143328 640 | 4.334955 1000 | 6.893774 2000 | 14.607145 2500 | 18.819379
It runs under a minute, and has the desired order of magnitude for our operations (adding, testing, clearing, iterating).
Meanwhile, if I ever need to revisit this, I have some aces up my sleeve:
- There is an 1984 paper “Faster Random Sampling methods(Jeffrey Scott Vitter)”
- I can always special case: if sample size below a carefully chosen threshold use S0 else, use something better suited for larger samples.
This will give me best of both worlds at the cost of ugliness.
My mistake
Have you figured out what I did wrong strategically? In the above example, I made it several times: I allowed profiling to set the scope of my optimization efforts. Profiling is great to discover bottlenecks and the possible gains of removing them, but it will give you a sort of narrowmindedness that limits the degree of success. Once you discovered a bottleneck, you need to step back, and also look at the context. The bigger the chunk you’ll be optimizing the higher the possible gains. In my concrete case, I should have been looking for a better sampling method.
Instead, I searched for a better set representation. The problem is that you tend to find what you’re looking for.
Armed with the new insight, I propose the following laws of optimization.
- profile before you optimize
- if you find a bottleneck, try to optimize the broad context CONTAINING the bottleneck.
- tweak code as a last resort
Confessions of a revisionist
I must confess that I moved the example out of its original context, which was a C codebase. I recreated the different versions we had of the C code in OCaml for your pleasure.
So yes, we made the common mistake of going to a lower level programming language too soon, naively thinking we had a good understanding of the problem we were trying to solve.
As a result, we wasted more time than we should have. Anyway, in the end I hope I compensated enough by writing freely about my mistakes, so you can avoid them.
have fun,
Romain.
PS
For those interested in the code itself. I have pushed the code to a git repo :
https://github.com/toolslive/Sampling
Baardskeerder’s transaction strategy
Posted: November 28, 2011 Filed under: Baardskeerder, OCaml, Programming, Research | Tags: baardskeerder, compaction, OCaml, transactions Leave a comment »Baardskeerder is a simple embedded database based around an append-only B-treeish datastructure.
It’s a dictionary that also supports range queries and transactions.
Baardskeerder is implemented in ocaml, and the main idea is that it will replace Tokyo Cabinet in our Arakoon Key-Value Store.
This post will try to will explain our approach regarding transactions.
Preliminaries
Baardskeerder appends slabs to a log. Slabs are list of entries, and entries can be
either values, leaves, indexes, or commits. Suppose we start with an empty log, and add a first update (set “f” “F”).
Baardskeerder builds a slab that looks like this:
Value "F" Leaf ["f", Inner 0] Commit (Inner 1)
This slab gets serialized, and added to a log, which looks like this:
Value "F" Leaf ["f", Outer 0] Commit (Outer 1)
Notice there are Outer and Inner references. Outer ones refer to positions in the Log, while Inner ones refer to the current slab. During serialization, each inner positions needs to be translated into an outer one, while outer positions just can be copied. (In reality, Outer positions are a bit more complex to calculate as the size of entries after serialization influences them, but I’ve abstracted away from that problem here)
When a slab is ready to be serialized and added, it always ends with a Commit node. This commit node is necessary as there might be a calamity while writing out the serialized slab and it might not end up entirely in the log. So after a successful update, the last entry in the log is a Commit pointing to the current root node.
To fully appreciate the difference between slab and log references,
let’s add another update (set “d” “D”).
The new slab looks like this:
Value "D" Leaf ["d", Inner 0; "f", Outer 0] Commit (Inner 1)
The new leaf in the slab refers both to the value at position 0 in the log (Value “F”) and to the value at position 0 in the slab (Value “D”).
After serialization of the slab, and writing it to the log, the log looks like this:
Value "F" Leaf ["f", Outer 0] Commit (Outer 1) Value "D" Leaf ["d", Outer 3; "f", Outer 0] Commit (Outer 4)
Let’s add some more updates to have a more interesting log:
set "h" "H"; set "a" "A"; set "z" "Z"
After which, the log looks like this:
Value "F" Leaf ["f", Outer 0] Commit (Outer 1) Value "D" Leaf ["d", Outer 3; "f", Outer 0] Commit (Outer 4) Value "H" Leaf ["d", Outer 3; "f", Outer 0; "h", Outer 6] Commit (Outer 7) Value "A" Leaf ["a", Outer 9; "d", Outer 3] Leaf ["f", Outer 0; "h", Outer 6] Index Outer 10, ["d", Outer 11] Commit (Outer 12) Value "Z" Leaf ["f", Outer 0; "h", Outer 6; "z", Outer 14] Index Outer 10, ["d", Outer 15] Commit (Outer 16)
This corresponds to the following tree:
There are several little things to notice here. First, leaves that overflow (size = 4) are split into to new leaves with an Index entry to refer to them both. Also, the entries in the log only refer to previous entries, and, most importantly, nothing ever changes inside the log.
Transactions
Suppose by some accident the process dies while writing the slab, the last slab doesn’t quite fully make it to the log. In the above example, the log might end up just after the value “Z” (or a bit further with half a leaf written).
At the next opening of the log, we’ll search for the last commit, and the log gets truncated just after position 13. As such, the presence of Commit entries that make updates atomic.
They also provide a means to implement non-concurrent transactions. In fact, if we only have a commit entry at the end of a slab, the slab literally is the transaction. At the beginning of the transaction, we create a slab, and we pass it along for all updates. A commit entry is written at the end of the slab, when we’re done. The only added complexity comes from descending the tree:
We must start at the root in the slab (or log if it’s empty) and if we have to jump to an entry, we have to look at its position and jump into the slab if it’s an inner position, and into the log if it’s an outer one. (How to descend a tree is described on more detail a previous blog post)
Our transactional API is simple and clean:
module DBX :
sig
type tx
val get : tx -> k -> v
val set : tx -> k -> v -> unit
val delete : tx -> k -> unit
...
val with_tx : log -> (tx -> unit) -> unit
end
The following code fragment illustrates our transaction api:
let xs = ["f","F";
"d","D";
"h","H";
"a","A";
"z","Z";
]
in
DBX.with_tx log
(fun tx ->
List.iter (fun (k,v) ->
DBX.set tx k v) xs
)
and suppose we apply this transaction to an empty log, we then end up with the following:
Value "F" Leaf ["f", Outer 0] Value "D" Leaf ["d", Outer 2; "f", Outer 0] Value "H" Leaf ["d", Outer 2; "f", Outer 0; "h", Outer 4] Value "A" Leaf ["a", Outer 6; "d", Outer 2] Leaf ["f", Outer 0; "h", Outer 4] Index Outer 7, ["d", Outer 8] Value "Z" Leaf ["f", Outer 0; "h", Outer 4; "z", Outer 10] Index Outer 7, ["d", Outer 11] Commit (Outer 12)
Performance Summary: each transaction boils down to 1 slab, and each slab is written using 1 system call (write).
Slab Compaction
Batching updates in transactions uses less space compared to individual updates, but we still have a lot of dead entries. Take for example the log above. Entries 1,3, 5,8, and 9 are not useful in any way. So it would be desirable to prune them from the slab before they even hit the log.
This turns out to be surprisingly easy.
First we have to find the dead entries, which is done with a simple iteration through the slab, starting from the root, and marking every Inner position as known. When you’re done,
the positions that you didn’t mark are garbage.
type t = { mutable es: entry array; mutable nes: int}
let mark slab =
let r = Array.make (slab.nes) false in
let maybe_mark = function
| Outer _ -> ()
| Inner x -> r.(x) <- true
in
let maybe_mark2 (_,p) = maybe_mark p in
let mark _ e =
match e with
| NIL | Value _ -> ()
| Commit p -> maybe_mark p
| Leaf l -> List.iter maybe_mark2 l
| Index (p0,kps) -> let () = maybe_mark p0 in List.iter maybe_mark2 kps
in
let () = iteri_rev slab mark in
let () = r.(slab.nes -1) <- true in
r
After we have marked the dead entries, we iterate through the slab again, from position 0 towards the end, and build a mapping that tells you how to translate an old Inner position to a new Inner position.
let mapping mark =
let s = Array.length mark in
let h = Hashtbl.create s in
let rec loop i o =
if i = s then h
else
let v = mark.(i) in
let i' = i + 1 in
let () = Hashtbl.add h i o in
let o' = if v then o + 1 else o in
loop i' o'
in
loop 0 0
The last phase is the actual rewriting of the slab, without the dead entries. This again is a simple iteration.
let rewrite s s_mark s_map =
let lookup_pos = function
| Outer x -> Outer x
| Inner x -> Inner (Hashtbl.find s_map x)
in
let rewrite_leaf kps = List.map (fun (k,p) -> (k,lookup_pos p)) kps in
let rewrite_index (p0,kps) = (lookup_pos p0 , rewrite_leaf kps) in
let rewrite_commit p = lookup_pos p in
let esa = s.es in
let size = s.nes in
let r = Array.create size NIL in
let rec loop c i =
if i = size
then { es = r; nes = c}
else
begin
let i' = i + 1 in
let a = s_mark.(i) in
if a then
let e = esa.(i) in
let e' = match e with
| Leaf l -> Leaf (rewrite_leaf l)
| Index i -> Index (rewrite_index i)
| Commit p -> Commit (rewrite_commit p)
| Value _
| NIL -> e
in
let () = r.(c) <- e' in
let c' = c + 1 in
loop c' i'
else
loop c i'
end
in
loop 0 0
Gluing it all together:
let compact s = let s_mark = mark s in let s_map = mapping s_mark in rewrite s s_mark s_map
If we take a look at the slab for our example transaction of 5 sets before compaction
Value "F" Leaf ["f", Inner 0] Value "D" Leaf ["d", Inner 2; "f", Inner 0] Value "H" Leaf ["d", Inner 2; "f", Inner 0; "h", Inner 4] Value "A" Leaf ["a", Inner 6; "d", Inner 2] Leaf ["f", Inner 0; "h", Inner 4] Index Inner 7, ["d", Inner 8] Value "Z" Leaf ["f", Inner 0; "h", Inner 4; "z", Inner 10] Index Inner 7, ["d", Inner 11] Commit (Inner 12)
and after compaction:
Value "F" Value "D" Value "H" Value "A" Leaf ["a", Inner 3; "d", Inner 1] Leaf ["f", Inner 0; "h", Inner 2] Value "Z" Leaf ["f", Inner 0; "h", Inner 2; "z", Inner 6] Index Inner 4, ["d", Inner 7] Commit (Inner 8)
We can see it’s worth the effort.
Of course, the bigger the transaction, the bigger the benefit of compaction.
Question to my 2.5 readers: If you’ve seen a paper, blog, article,… describing this, please be so kind to send me a link.
Offline log compaction
Given Baardskeerder has transactions with slab compaction, we can build a simple and efficient offline compaction algorithm for log files. Start from the root, walk over the tree while building big transactions (as big as you want, or as big as you can afford), and add these to a new log file. When you’re done, you have your compacted log.
Online log compaction
Baardskeerder logs can also return garbage to the filesystem while they are in use, but that will be kept for a later post.
have fun,
Romain.
Baardskeerder is Afrikaans for barber, but also refers to an order of creatures called Solifugae.
A more marketing-prone name would have been Mjolnir Mach3 Ultra Turbo Plus Enterprise Edition with Aloe vera.What’s in a name?








