Full-Text Search For Relational Data

by
Tags: , ,
Category:

Thanks to internet search engines like Google, Yahoo, and Bing, everyone has become accustomed to searching their data effortlessly. Consequently, as developers of web-based applications, we increasingly find that our customers’ requirements include being able to search their application’s data in a “Google-like” way.

For a very small subset of applications that only need to do full-text search on a few database columns, there a some very simplistic solutions, ranging from using a SQL ‘like’ clause on every column to database-vendor supplied full-text search functionality for designated columns.

Frequently, however, these simplistic approaches are not enough, either because the data to be searched is large and complex, because the approach’s search functionality is limited, or because database neutrality is desired. For these situations, a more advanced solution is needed. Fortunately, there are now a number of solutions available for all of the most popular platforms, languages, and frameworks, and several hosted solutions that are development stack-neutral.

Lucene and Compass

One of the most popular full-text search projects is Apache Lucene, a Java-based, open-source indexing and searching solution. Lucene itself targets indexing and searching of ‘documents’, but several other projects, most notably Compass, have adapted Lucene for indexing and searching relational data via integration with Object-Relational Mapping libraries such as Java’s JPA, Hibernate, and others.

The combination of Lucene and Compass provides Java developers with an easy way to implement Google-like searching of their relational data, including fuzzy searching (spell-check / ‘Did you mean…’), stemming (‘fishing’, ‘fished’, ‘fish’, and ‘fisher’ all match ‘fish’), wildcards, and boolean logic.

For users of the Java JPA ORM, usage of Compass is fairly straightforward. For users of the Grails framework, the Grails Searchable Plugin makes implementation of full-text searching almost effortless. I recently added this plugin to a Grails application with a moderately complex data model, and had my full-text search functionality up and running in a few hours, with fine-tuning taking only a few more hours. As the application’s data model has grown, very little time has had to be spent on keeping the search functionality up to date.

There are many other programming languages and platforms in use besides Java, so not surprisingly, Lucene has been ported to many of them, including C, C++, .NET, Objective-C, Python, Perl, and Ruby. The most popular of these, due to the popularity of the programming languages, are Lucene.net, which is a C# port targeted at the .NET runtime, and Ferret, which is a Ruby port targeted at the Ruby on Rails framework.

While Lucene and its variants currently dominate the open-source full-text search landscape, there are other well-known projects, including Xapian.

Hosted Solutions

There may be situations where it makes more sense move your full-text search functionality ‘to the cloud’. Several hosted solutions are available, some based on Lucene or other open-source technologies, as well as other, proprietary offerings.

Websolr is a hosted version of Apache’s Solr which is itself built on Lucene. Other offerings include Flying Sphinx for Rails (which uses the Sphinx search engine), and IndexTank, which is a proprietary offering.

Both Flying Sphinx and IndexTank are available as add-ons for the popular EngineYard and Heroku Ruby-based hosting platforms.

I recently tried out IndexTank on Heroku. IndexTank itself targets indexing and searching documents, but the Tanker gem does a great job of integrating IndexTank with Rails’ ActiveRecord ORM. IndexTank’s documentation doesn’t mention it, and while not available “out of the box”, IndexTank does support both fuzzy searching and stemming. All you need to do is send an email to their customer support address, and ask to have it turned on.

If your application can benefit from full-text search, there are plenty of interesting choices, regardless of your platform or programming language.