Cloud Search Using Suffix Arrays ? Well, … maybe.

Suffix Arrays are arrays that allow you to find exact substring matches. The core idea is that you generate a sorted array of positions using a comparison function that compares the suffixes starting at the respective positions.

Constructing a suffix array

It’s one of the cases where a few lines of code is more clarifying than the explanation itself.

def make_suffix_array(text):
    size = len(text)
    def compare_suffix(i,j):
        return cmp(text[i:],text[j:])
    
    indexes = range(size)
    indexes.sort(cmp = compare_suffix)
    return indexes

If you try this on the string “Suffix Arrays Rock?” and print out the sorted array’s indexes along with the suffixes that start there, you start to see both potential, as well as weaknesses.

pos suffix
06 ‘ Arrays Rock?’
13 ‘ Rock?’
18 ‘?’
07 ‘Arrays Rock?’
14 ‘Rock?’
00 ‘Suffix Arrays Rock?’
10 ‘ays Rock?’
16 ‘ck?’
02 ‘ffix Arrays Rock?’
03 ‘fix Arrays Rock?’
04 ‘ix Arrays Rock?’
17 ‘k?’
15 ‘ock?’
09 ‘rays Rock?’
08 ‘rrays Rock?’
12 ‘s Rock?’
01 ‘uffix Arrays Rock?’
05 ‘x Arrays Rock?’
11 ‘ys Rock?’

Note: This method of creating suffix arrays is criminally inefficient, but it suffices for explaining the concept. There are algorithms to do this in linear time.

You can use the suffix array to search for a string in the text using binary search. Something like this:

def find(sa, text, q):
    size = len(sa)
    qsize = len(q)
    hi = size -1
    lo = 0
    while hi >= lo:
        mid = (hi + lo) / 2
        begin = sa[mid]
        end = min(size, begin + qsize)
        test = text[begin: end]
        if test > q:
            hi = mid -1
        elif test < q:
            lo = mid + 1
        else:
            return begin
    return None

You have a fast search that allows you to return the exact position of the substring.
Even better: all matches are clustered together in the suffix array. Perfect locality.

Multiple documents

The concept easily extendeds itself to multiple documents.

def make_multi(texts):
    indexes = []
    nt = len(texts)
    for d in xrange(nt): 
        size = len(texts[d])
        for i in xrange(size):
            e = d,i
            indexes.append(e)

    def compare(e0,e1):
        d0, i0 = e0
        d1, i1 = e1
        s0 = texts[d0][i0:]
        s1 = texts[d1][i1:]
        return cmp(s0,s1)

    indexes.sort(cmp = compare)
    return indexes

A minimal example with 3 very small texts, and a bit of code to dump the suffix array, together with the suffixes.

def print_multi(sam, texts):
    for e in sam:
        d, i = e
        suffix = texts[d][i:]
        print "(%2i,%2i)\t%s" % (d,i,suffix)

texts = ["Suffix Arrays Rock?",
         "Redundant Array of Inexpensive Disks",
         "Metal is not Rock!"]
r0 = make_multi(texts)
print_multi(r0, texts)

yields:

( 1, 9)	 Array of Inexpensive Disks
( 0, 6)	 Arrays Rock?
( 1,30)	 Disks
( 1,18)	 Inexpensive Disks
( 2,12)	 Rock!
( 0,13)	 Rock?
( 2, 5)	 is not Rock!
( 2, 8)	 not Rock!
( 1,15)	 of Inexpensive Disks
...
( 1,28)	ve Disks
( 0, 5)	x Arrays Rock?
( 1,22)	xpensive Disks
( 1,14)	y of Inexpensive Disks
( 0,11)	ys Rock?

Ok, this seems to have almost perfect scalability properties. What’s the catch?

The catch

There are some severe downsides to Suffix Arrays.

  • Text needed during search
    A major disadvantage to suffix arrays is that you need access to the text while you do a search. It means that the suffix array and the text need to be kept close to each other.
  • Storage Needs
    If you think of it, a text is an array of chars (ascii, not unicode). Its suffix array is an array of positions. Suppose you store a position and document ids as a 32bit words. Remember you need to store the text, as well as the suffix array. If the text size is n, the total needs are 9n. To make matters worse: since you’ll be jumping around in the suffix array, as well as the text, compression schemes are not really feasible.
  • Updates?
    Even the slightest update to the text will invalidate the suffix array, and you need to reconstruct it. Even worse, if you have an array over multiple documents, an update in one of the documents will invalidate the entire structure.
  • Exact Matches
    You get what you asked for. Searching the example text for ‘suffix’ will not yield a result.
  • Overkill
    You can search for an exact match on a string of arbitrary length, but nobody needs to find exact matches on large strings. There is one notable exception: If you’re trying to look up DNA fragments in a large DNA string, this is exactly what you need. For everybody else, it’s overkill.

Getting better all the time

Well, it couldn’t get much worse now could it?
It doesn’t need to be this bad, and you can actually pick how much pain you want to inflict on yourself.

  • search starts at start of words
    Most of the time, it’s not needed to be able to search for substrings of words. For example, you probably don’t need to find the ‘rays’ in ‘arrays’. It divides the number of positions to sort by the average word size.
  • remove language noise
  • Not all words need to be considered. In fact, the most frequent words are noise.

  • old versions can be blacklisted
    If all versions of all your documents have a unique id, you can blacklist older unwanted document ids, and filter them out before you return the results. This means you don’t have to do updates immediately, but you can schedule them when it’s cheapest for you.
  • text size is not document size
    Suppose your documents are pdfs. The actual size of the text is small compared to all the other information in the file (markup, images, fonts, …). Moreover, storing text will not eat your storage. For example, a heavyweight (in every way) novel like War And Peace takes about 3.2 MB. The entire corpus of english wikipedia articles, which is stored as xml is just over 30 GB. peanuts.
  • multiple suffix arrays
    You don’t need add all your documents to one big suffix array. You can have many of them wich improves the situation in more than one way. Search can be parallel, and adding a new document is cheaper as well.

What does this have to do with ‘The Cloud’ ?

Since there is a network between the client issuing the request and the cloud infrastructure offering the search service, means you have a non-neglectable latency.
This means the latency added by the service just needs to be small compared to the overall latency, which makes our life a bit easier.

Also, the nature of ‘The Cloud’ pushes people to store consolidated instead of live data, which is exactly what is best suited for suffix arrays.
Another prerequisite for suffix arrays is the availability of the text during search, which is exactly what happens in a cloud setup.

Closing words

Suffix arrays are an idea that was conceived back when ‘online’ still had other semantics. At the time (1991) they were seen as too expensive for general search, but that was back when ‘big hard disks’ had less than 100MB of capacity. Maybe Suffix Arrays deserve a second look.

I can feel the urge to whip up an email search/indexing service in a few hundred lines of code, or a private wikipedia search service.
Just one question remains: what shall it be? ocaml or haskell? hm, tough one.

have fun,

Romain.

About these ads

2 Comments on “Cloud Search Using Suffix Arrays ? Well, … maybe.”

  1. fcicq says:

    PFI, a tech startup company in Japan(preferred.jp) have a commercial Search Engine implementation based on Compressed Suffix Array. A demo for searching entire wikipedia by only one machine with SSD is available. SA is suitable for medium scale(~100GB to several TB), but not web scale.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.