Implementing Information Retrieval Systems

I believe that it’s valuable to implement software components that you use from scratch merely to gain a better understanding of the systems you use. I’ve never implemented an operating system from scratch or designed and implemented a programming language like many of my friends but I’d really like to one day. Part of the problem is that looking at a large, complicated, system from the outside is intimidating.

A couple of years ago I spent many months working with CLucene, the C++ port of the Lucene IR (full text search). Both CLucene and Lucene are huge, complicated and often appear to be buggy. While learning about the Lucenes I did a bunch of reading about full text search theory. It never seemed that complicated, but all of the implementations I came across clearly were.

A basic full text search engine works like this:

  • collect all of the words in the input documents
  • simplify the words so that similar ones are the same, so I can search for “look” and get “looked”, “looking” and “looker”
  • organize the words in a special tree data structure called a trie that lets you look up the words in linear time, the leaves of the trie for each word indicate which documents it’s present in
  • to search for a word you walk the trie and get the list of documents

Yesterday I spent a couple of hours walking (Tel Aviv public transport was failing me) and I worked through most of how to implement this in my head. I also thought about how to implement phrase matching. I’m not really sure how phrase matching works in other systems (like I said, they’re too complicated for me to understand easily) but I represent each instance of a word in a document and then link from one instance to the instance of the next word in the document. That way after walking down the trie to find the first word in the phrase I can walk along each word in the phrase, comparing it to the next word in the document.

Last night I implemented it in Python. I called it Tripe and it’s less than 250 lines of Python. It doesn’t have a stemmer worth talking about and it’s pretty space inefficient but it seems to generally work. There’s a library file called tripe.py and then some test utilities: tripe_add.py (add documents), tripe_search.py (search the index), tripe_dot.py (visualize the index with Graphviz). Take a look:

% echo "The cat sat on the mat." | ./tripe_add.py test.tripe 1
% ./tripe_dot.py test.tripe|dot -Tpng -o test1.png

% echo "The quick brown fox..." | ./tripe_add.py test.tripe 2
% ./tripe_dot.py test.tripe|dot -Tpng -o test2.png

% echo "There is a light that never goes out." | ./tripe_add.py test.tripe 3
% ./tripe_dot.py test.tripe|dot -Tpng -o test3.png

% ./tripe_search.py test.tripe the
matched in document 1 at 15
matched in document 1 at 0
matched in document 2 at 0
% ./tripe_search.py test.tripe cat
matched in document 1 at 4

I first implemented this as a tree in memory, but I found that implementing persistent storage was really easy too. I’ve always been hesitant to implement structured binary file formats, and I know that efficient implementation can be really complicated, but a naive implementation like I did was relatively straight-forward.

So now what? There are already a bunch of full-text search engines out there, but they all seem relatively complicated. I wonder if it’s worth tidying up what I have, extending it enough to be useful for site search applications and publishing it. In the meantime it’s on github.

7 replies on “Implementing Information Retrieval Systems”

  1. You should see the really really full text search I did for the help system on the NIC. No simplified words. No phrases. No trie. Still found relevant help topics way better than Windows or Linux. Wait, there’s help topics on Linux?

  2. Complication is an unfortunate consequence of time. It’s much harder to keep something simple than to let it grow.

  3. By the time you let your thoughts roll, and sat to code this in Python, you could have just read Lucene In Action and understand the concepts behind Lucene, and how it’s classes are implementing them. Understanding Lucene is also easier by reading the Java Lucene sources than CLucene’s.

    Lucene is not all the complicated, really. It is very well-structured, and ready for enterprise-level usage (can you allow for multi-searchers / indexers, or distributed 4GB index with your code?).

    IMHO, instead of reinventing the wheel, join existing developers and help their efforts in making it roll faster and more efficiently. If you don’t like reading up docs, call me next time you’re in Jerusalem 🙂

    1. I have read Lucene In Action (not cover to cover, but a lot of it), and I have looked through the Lucene sources (often to try to understand CLucene better), but I feel like implementing something yourself has some value – at least as a learning exercise.

      Perhaps Lucene isn’t complicated, but it is large. The compressed tar file is 12MB. That’s daunting when you’re trying to learn, and even when you’re trying to debug why the hell your indexes keep getting corrupted (a recurring problem with CLucene that I never resolved during my tenure at Flock). In some ways large is worse than complicated because it just becomes hard to keep things straight.

      Anyway, my intention isn’t to reinvent the wheel. It’s ultimately to understand the problem space better. I wasn’t interested in building something “enterprise-level” because I don’t have an enterprise to serve right now.

      What I found interesting while building my 250 line inverted index was that I could build application specific search systems rather than trying to customize general purpose ones like Lucene. A lot of the complexity in a system like Lucene is all of the support for specific use cases. It’s necessary for a general purpose library like Lucene to support these specific use cases because every real world system will involve the general inverted index plus something specific to be useful. If the general system can be implemented relatively easily, perhaps we should just be implementing specific full text search systems for specific applications rather than using Lucene’s various building blocks. I know this is counter to the computer science mantras of abstraction and reuse, but I think that those are often applied to readily. Who knows. I know that next time I need to build a search system for a real application I’ll try Lucene and a few of its friends before building my own 🙂

      I don’t know when I’ll make it back to Jerusalem. My two months in Israel ends on Tuesday when I fly to Ghana. I was visiting Jerusalem from Tel Aviv every week, but I think it’ll be a couple of years at least before I’m back for a visit. We were in Abu Ghosh and Latrun today, I can’t imagine how cold it’s gotten up in Jerusalem now! Brrr!

  4. Writing your own implementation of anything hardcore is a great learning exercise, I couldn’t agree more. I hope I had the time to do anything of the sort myself. What you were implying in your original post, and again in your reply, is that code you wrote from scratch (or will write wherever the need arise) may be better for some real-world scenarios. This is the point I strongly object.

    The Lucene index is actually very generic, and most of the code is meant for dealing with this generality, to provide you with the ability to use the library in a very broad set of usages. The library size you were complaining about, is actually Lucene’s strongest points. It has a record of 7+ years (CLucene has 7, JLucene even more) where many people have tried it under different circumstances, with different hardware and for different use-cases. They have both proof-tested it, and provided fixes or extensions. No new library or code written from scratch can match this.

    But, it is not just about stability. And not even about re-using code. I think in the open-source world, there’s a great value for joining forces and working on something together; re-writing something from scratch, for internal use or releasing it to the open, is something one should do only if there’s a compelling reason to do so. There are mainly two reasons for that – you and others. You don’t have to write all the code for yourself and can keep focused on your own business logic, relying on others to do this work for you, and along the way you help improve the original code base and by that you help others.
    I’m saying all this, because I don’t recall seeing any report regarding a corrupted index in CLucene. My memory may be fooling me, or I may be new to the project, but as it appears it is quite common for developers using open-source projects not to provide feedback or own patch work to the original project or developers. I think we all are losing lots of good stuff.

    You are right about the learning curve for all non-simple usage, although it is not as steep as it looks. Developing tools (Analyzer, Filter, Scorer etc) for a very customized search pattern is indeed not a task one will do with only basic understanding of the code, but is not too hard a task to learn. If there’s something I learned from your post, is how important in-depth documentation, articles and tutorials are. Hopefully we’ll get the time to write many of them soon; right now we are focused on making some really cool code improvements.

    You definitely had a very cold epilogue to your journey. It snowed last night up north and in the Jerusalem area. In Israel, that happens like once or twice a year. A real celebration…

    1. I definitely understand the advantages of a powerful, well tested platform like [C]Lucene, but many search applications I’ve seen don’t need what it has to offer. I’m becoming less and less convinced in one-size-fits-all solutions as the universal answer, especially for applications that only require a subset of the functionality offered by comprehensive packages like Lucene.

      As for our CLucene index corruption issues, as far as I remember we talked to developers on IRC and to Ben in person. We were never able to reproduce the issues in a controlled environment and when we did get corrupted indexes (for example by sending users external hard disks to copy their corrupt indexes to) we couldn’t work out what was wrong.

Comments are closed.