Implementing Information Retrieval Systems
I believe that it's valuable to implement software components that you use from scratch merely to gain a better understanding of the systems you use. I've never implemented an operating system from scratch or designed and implemented a programming language like many of my friends but I'd really like to one day. Part of the problem is that looking at a large, complicated, system from the outside is intimidating.
A couple of years ago I spent many months working with CLucene, the C++ port of the Lucene IR (full text search). Both CLucene and Lucene are huge, complicated and often appear to be buggy. While learning about the Lucenes I did a bunch of reading about full text search theory. It never seemed that complicated, but all of the implementations I came across clearly were.
A basic full text search engine works like this:
- collect all of the words in the input documents
- simplify the words so that similar ones are the same, so I can search for "look" and get "looked", "looking" and "looker"
- organize the words in a special tree data structure called a trie that lets you look up the words in linear time, the leaves of the trie for each word indicate which documents it's present in
- to search for a word you walk the trie and get the list of documents
Yesterday I spent a couple of hours walking (Tel Aviv public transport was failing me) and I worked through most of how to implement this in my head. I also thought about how to implement phrase matching. I'm not really sure how phrase matching works in other systems (like I said, they're too complicated for me to understand easily) but I represent each instance of a word in a document and then link from one instance to the instance of the next word in the document. That way after walking down the trie to find the first word in the phrase I can walk along each word in the phrase, comparing it to the next word in the document.
Last night I implemented it in Python. I called it Tripe and it's less than 250 lines of Python. It doesn't have a stemmer worth talking about and it's pretty space inefficient but it seems to generally work. There's a library file called tripe.py and then some test utilities: tripe_add.py (add documents), tripe_search.py (search the index), tripe_dot.py (visualize the index with Graphviz). Take a look:
% echo "The cat sat on the mat." | ./tripe_add.py test.tripe 1 % ./tripe_dot.py test.tripe|dot -Tpng -o test1.png% echo "The quick brown fox..." | ./tripe_add.py test.tripe 2 % ./tripe_dot.py test.tripe|dot -Tpng -o test2.png
% echo "There is a light that never goes out." | ./tripe_add.py test.tripe 3 % ./tripe_dot.py test.tripe|dot -Tpng -o test3.png
% ./tripe_search.py test.tripe the matched in document 1 at 15 matched in document 1 at 0 matched in document 2 at 0 % ./tripe_search.py test.tripe cat matched in document 1 at 4
I first implemented this as a tree in memory, but I found that implementing persistent storage was really easy too. I've always been hesitant to implement structured binary file formats, and I know that efficient implementation can be really complicated, but a naive implementation like I did was relatively straight-forward.
So now what? There are already a bunch of full-text search engines out there, but they all seem relatively complicated. I wonder if it's worth tidying up what I have, extending it enough to be useful for site search applications and publishing it. In the meantime it's on github.
Flipy, a new Python library for Flickr
In the past day or so I've written a new Python library for Flickr. It came from some frustration using other Python libraries. They're all great, but none of them work quite how I want.
My goal was to have a library that feels like Python and the Flickr API at the same time. I think it's worked out pretty well so far. You can make calls using the standard Flickr API calls as documented on the Flickr site, but the response objects feel like normal Python objects. For example you can do something like this:
from flipy import Flipy flickr = Flipy(MY_API_KEY) me = flickr.people.findByUsername(username='ianloic') me_info = flickr.people.getInfo(user_id=me.nsid) print 'My name is %s. I have %s photos at %s.' % (me_info.realname, me_info.photos.count, me_info.photosurl)
I've put more details about the mapping in the README.
Beyond simple mapping of methods to responses I'm working on decorating certain important response objects such as users and photos with more object oriented methods. For example right now if you have a user object you can call user.photos() and get a iterator for of a user's photos. My code takes care of all of the paging behind the scenes.
Since I haven't implemented authentication or uploading yet so right now it's mostly useful for simple mashup-style applications, but I'll get uploads and authentication complete when I get back from Jordan next week.
Check out the code on github and let me know what you think.
An fnmatch implementation using finite state machines and LLVM
For my amusement (and I guess education) I decided to implement a regular expression language on top of LLVM using a Ken Thompson style finite state machine algorithm. Instead of implementing classic POSIX regular expressions I chose to implement something closer to POSIX fnmatch expressions for a couple of reasons. The fnmatch language is simpler to parse than regular expressions and regexes as they are commonly understood and used are not true regular expressions and can't be expressed as finite state machines.
My last experiment with LLVM and fnmatch was built in C++ but this time I chose Python. I'd been prototyping in Python and after I found the llvm-py module I couldn't bring myself to port it all to C++. I spent several days trying to work out the right incantations of STL to represent the Python structures I'd chosen correctly and efficiently in C++ before just embracing Python as the implementation language.
nfa.py
Following Thompson's technique (as explained by Russ Cox) I first converted the fnmatch rule (based on SUSv3 documentation) to non-deterministic finite automaton (NFA) form. Building the NFA from the fnmatch pattern string is straight-forward. Except for bracket expressions each character in the string becomes one node in the NFA. Each bracket expression becomes one node.
dfa.py
Transforming the NFA to deterministic finite automaton (DFA) form is a little trickier, but Russ Cox's explanation of the technique made it pretty straight-forward. Each DFA node maps to one or more NFA nodes so that for a given input string there is only one DFA node that would be reached.
Russ Cox's documentation doesn't cover character classes (ie: wildcards and bracket expressions) and I found that it was a bit tricky to represent and track these when converting from NFA to DFA form. Eventually I came up with a CharacterSet class that represents a set of matching characters, either by inclusion (tracking which characters are in the set) or exclusion (tracking which characters are not in the set). This was handy for representing bracket expressions but invaluable for storing the transitions between DFA states. The CharacterSet class stores a set of characters and a boolean to remember if it's tracking inclusion or exclusion. I built the set operations I needed on top of that including containment, equality, union, difference and intersection.
The part of NFA to DFA transformation that I found trickiest was determining the set of DFA nodes that would be reached from each of the NFA nodes associated with a DFA node. For each NFA node there are a set of descendants that a particular character set maps to. In the NFA the character sets can overlap, but in a DFA we must make sure all of the character sets are disjoint - so that the state machine is deterministic. Additionally, since each DFA node is associated with multiple NFA nodes we need to work out which set of NFA descendants can be reached by the same set of characters and should be treated as a single DFA node.
I wrote a function distinctCharacterSets that for a set of CharacterSets return a set of disjoint CharacterSets where each input CharacterSet can be expressed as the union of one or more output CharacterSets, the union of the input CharacterSets is equal to the union of the output CharacterSets and there are no empty CharacterSets. On top of that I built a function to turn the list of descendants from all of the NFA nodes associated with a DFA node into DFA descendants.
I didn't see this approach discussed in simple descriptions of the Thompson regular expression method and the implementations I tried reading were optimized beyond clarity but unless I'm missing an obvious alternative I'm sure similar structures and algorithms are used in finite automaton based regular expression engines.
compiler.py
While both the DFA and NFA can be interpreted the whole point of the exercise for me was to compile the expression to native code. I found the llvm-py module, a set of Python bindings for LLVM. It's fairly good but incomplete. I've made some patches and have them up on Launchpad.net.
I generate an LLVM function for each DFA with basic blocks for each state. At each state an LLVM switch operation jumps to the next state, to a block that returns true if the input string ends on a terminal state or to a block that returns false if there is no next state to match the input character. Instead of calling llvm-py's bindings for the LLVM optimizer I chose to call out to the command-line tool. It's easier and it seems fast enough. The generated code can be JITed or compiled statically to native code.
The generated code looks decent. There's definitely room for improvement, for example a * at the end of a pattern shouldn't require us to walk the whole string. The LLVM optimizer doesn't have much chance of catching code like that but it should be easy to catch things like that.
test.py
The implementation works, often. After building a tool to compare the results of Python's built-in fnmatch, the NFA, DFA and LLVM implementations I found that after several compiles there are often problems. The problems manifest as an incorrect result, a segmentation fault or a Python error. I'm not sure if these are manifestations of the same problem and I'm not sure if the problem is a bug in LLVM, llvm-py or a mistake in my use of llvm-py.
So, the theory works nicely, but the implementation leaves something to be desired. Hopefully the failures are easy to track down and easy to overcome. There is also very weird performance, but I'll discuss that in a later post. If you want to take a look at it the source is on github.
Python generator fun
I know Python's iterators and generators aren't that new anymore, but at heart I'm still a Python 1.5 programmer. I've come to iterators and generators in Python from my experience in JavaScript.
This morning I wanted to generate a list of names (for nodes in my NFA/DFA pattern matching code) that looked like: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, AA, AB, AC, AD, AE, AF, AG, AH, AI, AJ, AK, AL, AM, AN, AO, AP, AQ, AR, AS, AT, AU, AV, AW, AX, AY, AZ, BA, BB, BC, BD, etc... A couple of minutes of fiddling I came up with:
def name_generator():
from string import uppercase
from itertools import chain
for n in chain([''], name_generator()):
for c in uppercase:
yield n+c
It think the code is pretty simple, pretty readable, and pretty efficient. It's not rocket science, but I feel that iterators and generators have paid off well for Python.





