Implementing Information Retrieval Systems

I believe that it's valuable to implement software components that you use from scratch merely to gain a better understanding of the systems you use. I've never implemented an operating system from scratch or designed and implemented a programming language like many of my friends but I'd really like to one day. Part of the problem is that looking at a large, complicated, system from the outside is intimidating.

A couple of years ago I spent many months working with CLucene, the C++ port of the Lucene IR (full text search). Both CLucene and Lucene are huge, complicated and often appear to be buggy. While learning about the Lucenes I did a bunch of reading about full text search theory. It never seemed that complicated, but all of the implementations I came across clearly were.

A basic full text search engine works like this:

collect all of the words in the input documents
simplify the words so that similar ones are the same, so I can search for "look" and get "looked", "looking" and "looker"
organize the words in a special tree data structure called a trie that lets you look up the words in linear time, the leaves of the trie for each word indicate which documents it's present in
to search for a word you walk the trie and get the list of documents

Yesterday I spent a couple of hours walking (Tel Aviv public transport was failing me) and I worked through most of how to implement this in my head. I also thought about how to implement phrase matching. I'm not really sure how phrase matching works in other systems (like I said, they're too complicated for me to understand easily) but I represent each instance of a word in a document and then link from one instance to the instance of the next word in the document. That way after walking down the trie to find the first word in the phrase I can walk along each word in the phrase, comparing it to the next word in the document.

Last night I implemented it in Python. I called it Tripe and it's less than 250 lines of Python. It doesn't have a stemmer worth talking about and it's pretty space inefficient but it seems to generally work. There's a library file called tripe.py and then some test utilities: tripe_add.py (add documents), tripe_search.py (search the index), tripe_dot.py (visualize the index with Graphviz). Take a look:

% echo "The cat sat on the mat." | ./tripe_add.py test.tripe 1
% ./tripe_dot.py test.tripe|dot -Tpng -o test1.png

% echo "The quick brown fox..." | ./tripe_add.py test.tripe 2
% ./tripe_dot.py test.tripe|dot -Tpng -o test2.png

% echo "There is a light that never goes out." | ./tripe_add.py test.tripe 3
% ./tripe_dot.py test.tripe|dot -Tpng -o test3.png

% ./tripe_search.py test.tripe the
matched in document 1 at 15
matched in document 1 at 0
matched in document 2 at 0
% ./tripe_search.py test.tripe cat
matched in document 1 at 4

I first implemented this as a tree in memory, but I found that implementing persistent storage was really easy too. I've always been hesitant to implement structured binary file formats, and I know that efficient implementation can be really complicated, but a naive implementation like I did was relatively straight-forward.

So now what? There are already a bunch of full-text search engines out there, but they all seem relatively complicated. I wonder if it's worth tidying up what I have, extending it enough to be useful for site search applications and publishing it. In the meantime it's on github.