I believe that it's valuable to implement software components that you use from scratch merely to gain a better understanding of the systems you use. I've never implemented an operating system from scratch or designed and implemented a programming language like many of my friends but I'd really like to one day. Part of the problem is that looking at a large, complicated, system from the outside is intimidating.
A couple of years ago I spent many months working with CLucene, the C++ port of the Lucene IR (full text search). Both CLucene and Lucene are huge, complicated and often appear to be buggy. While learning about the Lucenes I did a bunch of reading about full text search theory. It never seemed that complicated, but all of the implementations I came across clearly were.
A basic full text search engine works like this:
Last night I implemented it in Python. I called it Tripe and it's less than 250 lines of Python. It doesn't have a stemmer worth talking about and it's pretty space inefficient but it seems to generally work. There's a library file called tripe.py and then some test utilities: tripe_add.py (add documents), tripe_search.py (search the index), tripe_dot.py (visualize the index with Graphviz). Take a look:
% echo "The cat sat on the mat." | ./tripe_add.py test.tripe 1 % ./tripe_dot.py test.tripe|dot -Tpng -o test1.png % echo "The quick brown fox..." | ./tripe_add.py test.tripe 2 % ./tripe_dot.py test.tripe|dot -Tpng -o test2.png % echo "There is a light that never goes out." | ./tripe_add.py test.tripe 3 % ./tripe_dot.py test.tripe|dot -Tpng -o test3.png % ./tripe_search.py test.tripe the matched in document 1 at 15 matched in document 1 at 0 matched in document 2 at 0 % ./tripe_search.py test.tripe cat matched in document 1 at 4I first implemented this as a tree in memory, but I found that implementing persistent storage was really easy too. I've always been hesitant to implement structured binary file formats, and I know that efficient implementation can be really complicated, but a naive implementation like I did was relatively straight-forward.
So now what? There are already a bunch of full-text search engines out there, but they all seem relatively complicated. I wonder if it's worth tidying up what I have, extending it enough to be useful for site search applications and publishing it. In the meantime it's on github.