<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Implementing Information Retrieval Systems</title>
	<atom:link href="http://ianloic.com/2010/02/05/implementing-information-retrieval-systems/feed/" rel="self" type="application/rss+xml" />
	<link>http://ianloic.com/2010/02/05/implementing-information-retrieval-systems/</link>
	<description>from Ian McKellar</description>
	<lastBuildDate>Fri, 11 Nov 2011 12:15:31 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Ian McKellar</title>
		<link>http://ianloic.com/2010/02/05/implementing-information-retrieval-systems/comment-page-1/#comment-1838</link>
		<dc:creator>Ian McKellar</dc:creator>
		<pubDate>Sun, 07 Feb 2010 14:42:58 +0000</pubDate>
		<guid isPermaLink="false">http://ianloic.com/?p=198#comment-1838</guid>
		<description>I definitely understand the advantages of a powerful, well tested platform like [C]Lucene, but many search applications I&#039;ve seen don&#039;t need what it has to offer. I&#039;m becoming less and less convinced in one-size-fits-all solutions as the universal answer, especially for applications that only require a subset of the functionality offered by comprehensive packages like Lucene.

As for our CLucene index corruption issues, as far as I remember we talked to developers on IRC and to Ben in person. We were never able to reproduce the issues in a controlled environment and when we did get corrupted indexes (for example by sending users external hard disks to copy their corrupt indexes to) we couldn&#039;t work out what was wrong.</description>
		<content:encoded><![CDATA[<p>I definitely understand the advantages of a powerful, well tested platform like [C]Lucene, but many search applications I&#8217;ve seen don&#8217;t need what it has to offer. I&#8217;m becoming less and less convinced in one-size-fits-all solutions as the universal answer, especially for applications that only require a subset of the functionality offered by comprehensive packages like Lucene.</p>
<p>As for our CLucene index corruption issues, as far as I remember we talked to developers on IRC and to Ben in person. We were never able to reproduce the issues in a controlled environment and when we did get corrupted indexes (for example by sending users external hard disks to copy their corrupt indexes to) we couldn&#8217;t work out what was wrong.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Itamar Syn-Hershko</title>
		<link>http://ianloic.com/2010/02/05/implementing-information-retrieval-systems/comment-page-1/#comment-1826</link>
		<dc:creator>Itamar Syn-Hershko</dc:creator>
		<pubDate>Sun, 07 Feb 2010 00:04:11 +0000</pubDate>
		<guid isPermaLink="false">http://ianloic.com/?p=198#comment-1826</guid>
		<description>Writing your own implementation of anything hardcore is a great learning exercise, I couldn&#039;t agree more. I hope I had the time to do anything of the sort myself. What you were implying in your original post, and again in your reply, is that code you wrote from scratch (or will write wherever the need arise) may be better for some real-world scenarios. This is the point I strongly object.

The Lucene index is actually very generic, and most of the code is meant for dealing with this generality, to provide you with the ability to use the library in a very broad set of usages. The library size you were complaining about, is actually Lucene&#039;s strongest points. It has a record of 7+ years (CLucene has 7, JLucene even more) where many people have tried it under different circumstances, with different hardware and for different use-cases. They have both proof-tested it, and provided fixes or extensions. No new library or code written from scratch  can match this.

But, it is not just about stability. And not even about re-using code. I think in the open-source world, there&#039;s a great value for joining forces and working on something together; re-writing something from scratch, for internal use or releasing it to the open, is something one should do only if there&#039;s a compelling reason to do so. There are mainly two reasons for that - you and others. You don&#039;t have to write all the code for yourself and can keep focused on your own business logic, relying on others to do this work for you, and along the way you help improve the original code base and by that you help others.
I&#039;m saying all this, because I don&#039;t recall seeing any report regarding a corrupted index in CLucene. My memory may be fooling me, or I may be new to the project, but as it appears it is quite common for developers using open-source projects not to provide feedback or own patch work to the original project or developers. I think we all are losing lots of good stuff.

You are right about the learning curve for all non-simple usage, although it is not as steep as it looks. Developing tools (Analyzer, Filter, Scorer etc) for a very customized search pattern is indeed not a task one will do with only basic understanding of the code, but is not too hard a task to learn. If there&#039;s something I learned from your post, is how important in-depth documentation, articles and tutorials are. Hopefully we&#039;ll get the time to write many of them soon; right now we are focused on making some really cool code improvements.

You definitely had a very cold epilogue to your journey. It snowed last night up north and in the Jerusalem area. In Israel, that happens like once or twice a year. A real celebration...</description>
		<content:encoded><![CDATA[<p>Writing your own implementation of anything hardcore is a great learning exercise, I couldn&#8217;t agree more. I hope I had the time to do anything of the sort myself. What you were implying in your original post, and again in your reply, is that code you wrote from scratch (or will write wherever the need arise) may be better for some real-world scenarios. This is the point I strongly object.</p>
<p>The Lucene index is actually very generic, and most of the code is meant for dealing with this generality, to provide you with the ability to use the library in a very broad set of usages. The library size you were complaining about, is actually Lucene&#8217;s strongest points. It has a record of 7+ years (CLucene has 7, JLucene even more) where many people have tried it under different circumstances, with different hardware and for different use-cases. They have both proof-tested it, and provided fixes or extensions. No new library or code written from scratch  can match this.</p>
<p>But, it is not just about stability. And not even about re-using code. I think in the open-source world, there&#8217;s a great value for joining forces and working on something together; re-writing something from scratch, for internal use or releasing it to the open, is something one should do only if there&#8217;s a compelling reason to do so. There are mainly two reasons for that &#8211; you and others. You don&#8217;t have to write all the code for yourself and can keep focused on your own business logic, relying on others to do this work for you, and along the way you help improve the original code base and by that you help others.<br />
I&#8217;m saying all this, because I don&#8217;t recall seeing any report regarding a corrupted index in CLucene. My memory may be fooling me, or I may be new to the project, but as it appears it is quite common for developers using open-source projects not to provide feedback or own patch work to the original project or developers. I think we all are losing lots of good stuff.</p>
<p>You are right about the learning curve for all non-simple usage, although it is not as steep as it looks. Developing tools (Analyzer, Filter, Scorer etc) for a very customized search pattern is indeed not a task one will do with only basic understanding of the code, but is not too hard a task to learn. If there&#8217;s something I learned from your post, is how important in-depth documentation, articles and tutorials are. Hopefully we&#8217;ll get the time to write many of them soon; right now we are focused on making some really cool code improvements.</p>
<p>You definitely had a very cold epilogue to your journey. It snowed last night up north and in the Jerusalem area. In Israel, that happens like once or twice a year. A real celebration&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ian McKellar</title>
		<link>http://ianloic.com/2010/02/05/implementing-information-retrieval-systems/comment-page-1/#comment-1825</link>
		<dc:creator>Ian McKellar</dc:creator>
		<pubDate>Sat, 06 Feb 2010 22:29:41 +0000</pubDate>
		<guid isPermaLink="false">http://ianloic.com/?p=198#comment-1825</guid>
		<description>I &lt;em&gt;have&lt;/em&gt; read Lucene In Action (not cover to cover, but a lot of it), and I have looked through the Lucene sources (often to try to understand CLucene better), but I feel like implementing something yourself has some value - at least as a learning exercise.

Perhaps Lucene isn&#039;t complicated, but it is large. The compressed tar file is 12MB. That&#039;s daunting when you&#039;re trying to learn, and even when you&#039;re trying to debug why the hell your indexes keep getting corrupted (a recurring problem with CLucene that I never resolved during my tenure at Flock). In some ways large is worse than complicated because it just becomes hard to keep things straight.

Anyway, my intention isn&#039;t to reinvent the wheel. It&#039;s ultimately to understand the problem space better. I wasn&#039;t interested in building something &quot;enterprise-level&quot; because I don&#039;t have an enterprise to serve right now.

What I found interesting while building my 250 line inverted index was that I could build application specific search systems rather than trying to customize general purpose ones like Lucene. A lot of the complexity in a system like Lucene is all of the support for specific use cases. It&#039;s necessary for a general purpose library like Lucene to support these specific use cases because every real world system will involve the general inverted index plus something specific to be useful. If the general system can be implemented relatively easily, perhaps we should just be implementing specific full text search systems for specific applications rather than using Lucene&#039;s various building blocks. I know this is counter to the computer science mantras of abstraction and reuse, but I think that those are often applied to readily. Who knows. I know that next time I need to build a search system for a real application I&#039;ll try Lucene and a few of its friends before building my own :)

I don&#039;t know when I&#039;ll make it back to Jerusalem. My two months in Israel ends on Tuesday when I fly to Ghana. I was visiting Jerusalem from Tel Aviv every week, but I think it&#039;ll be a couple of years at least before I&#039;m back for a visit. We were in Abu Ghosh and Latrun today, I can&#039;t imagine how cold it&#039;s gotten up in Jerusalem now! Brrr!</description>
		<content:encoded><![CDATA[<p>I <em>have</em> read Lucene In Action (not cover to cover, but a lot of it), and I have looked through the Lucene sources (often to try to understand CLucene better), but I feel like implementing something yourself has some value &#8211; at least as a learning exercise.</p>
<p>Perhaps Lucene isn&#8217;t complicated, but it is large. The compressed tar file is 12MB. That&#8217;s daunting when you&#8217;re trying to learn, and even when you&#8217;re trying to debug why the hell your indexes keep getting corrupted (a recurring problem with CLucene that I never resolved during my tenure at Flock). In some ways large is worse than complicated because it just becomes hard to keep things straight.</p>
<p>Anyway, my intention isn&#8217;t to reinvent the wheel. It&#8217;s ultimately to understand the problem space better. I wasn&#8217;t interested in building something &#8220;enterprise-level&#8221; because I don&#8217;t have an enterprise to serve right now.</p>
<p>What I found interesting while building my 250 line inverted index was that I could build application specific search systems rather than trying to customize general purpose ones like Lucene. A lot of the complexity in a system like Lucene is all of the support for specific use cases. It&#8217;s necessary for a general purpose library like Lucene to support these specific use cases because every real world system will involve the general inverted index plus something specific to be useful. If the general system can be implemented relatively easily, perhaps we should just be implementing specific full text search systems for specific applications rather than using Lucene&#8217;s various building blocks. I know this is counter to the computer science mantras of abstraction and reuse, but I think that those are often applied to readily. Who knows. I know that next time I need to build a search system for a real application I&#8217;ll try Lucene and a few of its friends before building my own <img src='http://ianloic.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>I don&#8217;t know when I&#8217;ll make it back to Jerusalem. My two months in Israel ends on Tuesday when I fly to Ghana. I was visiting Jerusalem from Tel Aviv every week, but I think it&#8217;ll be a couple of years at least before I&#8217;m back for a visit. We were in Abu Ghosh and Latrun today, I can&#8217;t imagine how cold it&#8217;s gotten up in Jerusalem now! Brrr!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Itamar Syn-Hershko</title>
		<link>http://ianloic.com/2010/02/05/implementing-information-retrieval-systems/comment-page-1/#comment-1823</link>
		<dc:creator>Itamar Syn-Hershko</dc:creator>
		<pubDate>Sat, 06 Feb 2010 18:43:58 +0000</pubDate>
		<guid isPermaLink="false">http://ianloic.com/?p=198#comment-1823</guid>
		<description>By the time you let your thoughts roll, and sat to code this in Python, you could have just read Lucene In Action and understand the concepts behind Lucene, and how it&#039;s classes are implementing them. Understanding Lucene is also easier by reading the Java Lucene sources than CLucene&#039;s.

Lucene is not all the complicated, really. It is very well-structured, and ready for enterprise-level usage (can you allow for multi-searchers / indexers, or distributed 4GB index with your code?).

IMHO, instead of reinventing the wheel, join existing developers and help their efforts in making it roll faster and more efficiently. If you don&#039;t like reading up docs, call me next time you&#039;re in Jerusalem :)</description>
		<content:encoded><![CDATA[<p>By the time you let your thoughts roll, and sat to code this in Python, you could have just read Lucene In Action and understand the concepts behind Lucene, and how it&#8217;s classes are implementing them. Understanding Lucene is also easier by reading the Java Lucene sources than CLucene&#8217;s.</p>
<p>Lucene is not all the complicated, really. It is very well-structured, and ready for enterprise-level usage (can you allow for multi-searchers / indexers, or distributed 4GB index with your code?).</p>
<p>IMHO, instead of reinventing the wheel, join existing developers and help their efforts in making it roll faster and more efficiently. If you don&#8217;t like reading up docs, call me next time you&#8217;re in Jerusalem <img src='http://ianloic.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: pvh</title>
		<link>http://ianloic.com/2010/02/05/implementing-information-retrieval-systems/comment-page-1/#comment-1821</link>
		<dc:creator>pvh</dc:creator>
		<pubDate>Fri, 05 Feb 2010 18:53:41 +0000</pubDate>
		<guid isPermaLink="false">http://ianloic.com/?p=198#comment-1821</guid>
		<description>Complication is an unfortunate consequence of time. It&#039;s much harder to keep something simple than to let it grow.</description>
		<content:encoded><![CDATA[<p>Complication is an unfortunate consequence of time. It&#8217;s much harder to keep something simple than to let it grow.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ian McKellar</title>
		<link>http://ianloic.com/2010/02/05/implementing-information-retrieval-systems/comment-page-1/#comment-1819</link>
		<dc:creator>Ian McKellar</dc:creator>
		<pubDate>Fri, 05 Feb 2010 12:27:02 +0000</pubDate>
		<guid isPermaLink="false">http://ianloic.com/?p=198#comment-1819</guid>
		<description>Yeah, but can it &lt;em&gt;scale&lt;/em&gt;? :)

That&#039;s when all that silly inverted index shit becomes useful.</description>
		<content:encoded><![CDATA[<p>Yeah, but can it <em>scale</em>? <img src='http://ianloic.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>That&#8217;s when all that silly inverted index shit becomes useful.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joel</title>
		<link>http://ianloic.com/2010/02/05/implementing-information-retrieval-systems/comment-page-1/#comment-1818</link>
		<dc:creator>Joel</dc:creator>
		<pubDate>Fri, 05 Feb 2010 12:11:28 +0000</pubDate>
		<guid isPermaLink="false">http://ianloic.com/?p=198#comment-1818</guid>
		<description>You should see the really really full text search I did for the help system on the NIC.  No simplified words.  No phrases.  No trie.  Still found relevant help topics way better than Windows or Linux.  Wait, there&#039;s help topics on Linux?</description>
		<content:encoded><![CDATA[<p>You should see the really really full text search I did for the help system on the NIC.  No simplified words.  No phrases.  No trie.  Still found relevant help topics way better than Windows or Linux.  Wait, there&#8217;s help topics on Linux?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

