Tracing block device write request sizes in Linux using SystemTap

Recently some people started to wonder how Arakoon, our distributed key-value store, handles the drive on which data is stored. To be more precise, this boils down to how Tokyo Cabinet (which we currently use as database implementation) submits write requests to the kernel, and how the kernel processes these later on, after translating the filesystem-level requests into write operations in the block layer.

This is considered important to know due to the usage of SSD drives (think wear leveling and other issues coming with these devices).

Tracing how the application writes data to a file can be achieved using the strace tool, looking for write(2) and pwrite(2) calls (although this is not necessarily sufficient: the application might be writing to a file using mmap(2) memory mappings).

On most loads, this won’t correspond to how block requests are sent to the actual hardware, though: writes can be buffered, reordered, etc. What we want is a way to gather information of write requests in the kernel block layer.

In Linux, one can use SystemTap (which is somewhat like DTrace on Solaris) to hook into the kernel at runtime and collect statistics, or lots of other information. One writes a script in which one hooks into trace points or function calls, then processes the incoming information.

I won’t go into the installation of SystemTap, since this will depend on the distribution you’re running. Do note you might not need root access to execute the scripts (so you shouldn’t either!): on my Fedora 16 system, all I had to do was putting my user in the stapsys, stapusr and stapdev groups (although I’m not sure I really need to be in all of those…). You’ll need debugging symbol files of your running kernel available, among other things (in my experience attempting to run a script on a system where some prerequisites are missing will result in the stap tool telling you what you need).

Writing SystemTap scripts sometimes requires some knowledge of kernel internals, to know where to hook in. Since I wanted to gather some statistics about block-layer calls, I knew I had to look for calls in the ‘bio’ subsystem, short for “Block IO”. Luckily SystemTap comes with some predefined combinations of hooks, called ‘tapsets’. These are stored (on my system at least) in /usr/share/systemtap/tapset. One of these files is called ioblock.stp, which sounds interesting. Since I was too lazy to look for documentation, I opened the file in a pager and read through it. The ioblock.request tap looked interesting, which provides access to a size value. After reading some example SystemTap scripts, here’s what I came up with:

global writes

probe ioblock.request {
    if(bio_rw_num(rw) == BIO_WRITE)
        writes[devname] <<< size
}

probe end {
    printf("\n")
    foreach([devname] in writes-) {
        printf("Device: %s\n", devname)
        println(@hist_log(writes[devname]))
    }
}

Running this using the stap tool, and closing it after a while (using CTRL-C) gives the following output:

$ stap -v blockio.stp
Pass 1: parsed user script and 92 library script(s) using 210196virt/30852res/3136shr kb, in 140usr/20sys/251real ms.
Pass 2: analyzed script: 3 probe(s), 21 function(s), 2 embed(s), 2 global(s) using 452960virt/226472res/92556shr kb, in 1500usr/100sys/2541real ms.
Pass 3: translated to C into "/tmp/stapeJldxD/stap_e40cc31c16b7dd290dabd43749c577a4_12507_src.c" using 452960virt/226600res/92684shr kb, in 10usr/0sys/12real ms.
Pass 4: compiled C into "stap_e40cc31c16b7dd290dabd43749c577a4_12507.ko" in 1780usr/350sys/2595real ms.
Pass 5: starting run.
^C
Device: sda5
value |-------------------------------------------------- count
    0 |@@@@@@@@@@@@@@@                                    30
    1 |                                                    0
    2 |                                                    0
      ~
  256 |                                                    0
  512 |                                                    0
 1024 |                                                    1
 2048 |                                                    0
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                        54
 8192 |@@@                                                 7
16384 |@                                                   3
32768 |                                                    0
65536 |                                                    0

Device: dm-2
value |-------------------------------------------------- count
    0 |@@                                                  5
    1 |                                                    0
    2 |                                                    0
      ~
  256 |                                                    0
  512 |                                                    0
 1024 |@                                                   2
 2048 |                                                    0
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      58
 8192 |@@@@@@@                                            14
16384 |@@@                                                 6
32768 |                                                    0
65536 |                                                    0

Device: dm-4
value |-------------------------------------------------- count
    0 |@                                                   2
    1 |                                                    0
    2 |                                                    0
      ~
  256 |                                                    0
  512 |                                                    0
 1024 |@                                                   2
 2048 |                                                    0
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      58
 8192 |@@@@@@@                                            14
16384 |@@@                                                 6
32768 |                                                    0
65536 |                                                    0

Device: dm-1
value |-------------------------------------------------- count
 1024 |                                                    0
 2048 |                                                    0
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 50
 8192 |                                                    0
16384 |                                                    0

Device: sda
value |-------------------------------------------------- count
 1024 |                                                   0
 2048 |                                                   0
 4096 |@@                                                 2
 8192 |                                                   0
16384 |                                                   0

Pass 5: run completed in 10usr/50sys/15274real ms.

This is a histogram of bio write requests to all block devices on my system while the script was running. If you’d see lots of requests smaller than the page size on an SSD device (this depends on the hardware), you might want to look into the write patterns of your application (although even then, things might not be as bad as they look like on first sight, since the SSD drive could perform batching, remapping and whatever else internally).

Do note the script output displayed above is not the behavior of Arakoon or Tokyo Cabinet: it depicts a rather standard desktop session (running Google Chrome, Evolution, Empathy, Skype and some more applications) during a very short timespan. The test was executed on a Fedora 16 laptop system, containing a spinning hard drive, using a combination of Ext4 and XFS on several LVM2 volumes on kernel 3.2.7-1.fc16.x86_64.

About these ads


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.