PostgreSQL Text Search

by
Tags:
Category:

Introduction

A common problem in software development is searching through text documents. For example, if you have a database of recipes, you might want to search by one or more ingredients, or if you have a collection of server log files, you might want to search for all errors that did not come from the database. This type of functionality is called “text search”. There are a lot of text search libraries like Lucene, or applications like ElasticSearch (which is built on top of Lucene) that are built specifically for text searching. Instead of automatically adding a new library or application to your project, take a look at Postgresql text search functionality to see if it meet your requirements.

Basics of Text Search

A text search engine typically works by parsing the document into its base components, which are known as lexemes. A lexeme is usually a word, plus other common forms of the word. For example, if a document contains “write”, “writes”, “writing”, or “wrote”, the text search engine may store that as the lexeme “write”. A search for any of the forms of the lexeme will be a match. In addition, the parsing process will ignore common words that are typically not relevant search terms, such as “the”, “of”, “a”, “and” etc.

Overview

The three basic components of text searching in Postgresql are “tsvector”, “tsquery”, and a match operator (“@@”). A tsvector is a text document that has been parsed into lexemes. Similarly, a tsquery is the lexemes for a single or set of text search criteria. The @@ text search match operator is used to determine if a tsvector matches a tsquery.

Basic Example

=> select 'the man was cooking his dinner'::tsvector @@ 'dinner'::tsquery;
?column?
----------
t

 

tsvector

The tsvector is a postgresql data type. As the above example shows, text can be cast to a tsvector. The better way to convert text to a tsvector is to use the to_tsvector function. The example tsvector should be written:

to_tsvector('the man was cooking his dinner').

To explain the difference:

Direct cast

=> select 'the man was cooking his dinner'::tsvector;
tsvector
--------------------------------------------
'cooking' 'dinner' 'his' 'man' 'the' 'was'

Function call

=> select to_tsvector('the man was cooking his dinner');
to_tsvector
-----------------------------
'cook':4 'dinner':6 'man':2

The function discards any words that are not interesting search terms, and uses the lexeme “cook” for “cooking”.

 

tsquery

As the name implies, a tsquery is a text search query object. Similar to tsvector, text can be cast directly to a tsquery, but it is better to use a function for this purpose. The main function for parsing text into a tsquery is to_tsquery.

There are a few operators for combining search terms:

And

The and operator is &

to_tsquery('man & dinner')

Or

The or operator is |

to_tsquery('lunch | dinner')

Not

The not operator is !

to_tsquery('lunch & !dinner')

Followed by

The followed by operator is <N>, and specifies the second term is within
N words of the first term. In the following example, the & query returns
true because both terms exist anywhere in the search document. The <2>
query returns false because “dinner” is more than two words after “man”.

=> select to_tsvector('the man was cooking his dinner') @@ to_tsquery('man & dinner');
?column?
----------
t
=> select to_tsvector('the man was cooking his dinner') @@ to_tsquery('man <2> dinner');
?column?
----------
f

In many examples, you will see term1 <-> term2. This is the same as
<1>, meaning term2 must immediately follow term1.
 

Regional Configuration

The text search parsing functions (to_tsquery and to_tsvector) each take an
optional configuration argument. Among other things, the parser uses this to
ignore the common words or find the root lexeme for a verb. The system default
region is used if no configuration is provided.

The following query returns the system default:

=> select get_current_ts_config();
get_current_ts_config
-----------------------
english

So, in the examples above, to_tsquery('man & dinner') is the same as
to_tsquery('english', 'man & dinner').

Execute the following query to list all of the available text search configurations:

=> select * from pg_catalog.pg_ts_config;
   oid |  cfgname   | cfgnamespace | cfgowner | cfgparser 
-------+------------+--------------+----------+-----------
  3748 | simple     |           11 |       10 |      3722
 13157 | arabic     |           11 |       10 |      3722
 13159 | armenian   |           11 |       10 |      3722
...

 

Storage and Indexing

To use Postgresql text search, there are a few different options for storing and indexing text documents, each
with their own advantages and disadvantages. In each of these instances, the
database will need an index for anything beyond trivial examples. The index
should typically be a GIN (“Generalized Inverted Index”) for most use cases.

Note: for each of the examples below, the scenario is providing search capability for a collection of recipes. Each recipe is its own document, and has not been broken down into ingredient lists, directions, timings, etc.

To simplify the example SQL statements, the variable “the_recipe” is set to the following for all examples:

\set the_recipe 'Chicken Pot Pie\n\nIngredients\n* 2 cups chicken diced\n* 1 cup peas\n* 1 cup carrots\n...';

 

Storing tsvector Only

If the document is stored outside of the database, only the processed tsvector
needs to be stored in the database. For example, if we already have a website
containing all of the recipes, and we now want to add search capability. This
would require some mechanism or procedure to keep the database tsvector in sync
with the externally stored document. In this case, the model could look something like the
following:

-- Create the table
CREATE TABLE recipes (
id INT PRIMARY KEY NOT NULL GENERATED BY DEFAULT AS IDENTITY,
url TEXT NOT NULL,
tsv TSVECTOR NOT NULL);

-- Create an index on the tsvector
CREATE INDEX recipe_tsv_idx ON recipes USING GIN (tsv);

-- Insert a row
INSERT INTO recipes(url, tsv)
VALUES ('http://example.com/chicken-pot-pie', to_tsvector(:'the_recipe'));

-- Sample query
=> SELECT url FROM recipes WHERE tsv @@ to_tsquery('chicken & peas');
url
------------------------------------
http://example.com/chicken-pot-pie

 

Storing Text Document Only

It is also possible to store just the text document, and not the parsed tsvector. In this case, a functional index can be used to create the text search index. Here is a basic example:

-- Create the table
CREATE TABLE recipes (
id INT PRIMARY KEY NOT NULL GENERATED BY DEFAULT AS IDENTITY,
title TEXT NOT NULL,
document TEXT NOT NULL);

-- Create the index
CREATE INDEX recipe_idx ON recipes USING GIN (to_tsvector('english',
document));

-- Insert a row
INSERT INTO recipes(title, document)
VALUES ('Chicken Pot Pie', :'the_recipe');

-- Sample query
=> SELECT title FROM recipes
WHERE to_tsvector(document) @@ to_tsquery('chicken & peas');
title
-----------------
Chicken Pot Pie

 

Storing Text Document With Generated tsvector Column

A third option is to store the text document and auto-generate a tsvector column from the document. This allows using a regular GIN index, and additionally, you do not need to use the to_tsvector call in the SQL select statements.

-- Create the table
CREATE TABLE recipes (
id INT PRIMARY KEY NOT NULL GENERATED BY DEFAULT AS IDENTITY,
title TEXT NOT NULL,
document TEXT NOT NULL);

-- Add the tsvector column
ALTER TABLE recipes
ADD COLUMN tsv TSVECTOR
GENERATED ALWAYS AS (to_tsvector('english', document)) STORED;

-- Create an index on the tsvector
CREATE INDEX recipe_tsv_idx ON recipes USING GIN (tsv);

-- Insert a row
=> INSERT INTO recipes(title, document)
VALUES ('Chicken Pot Pie', :'the_recipe');

-- Sample query
=> SELECT title FROM recipes WHERE tsv @@ to_tsquery('chicken & peas');
title
-----------------
Chicken Pot Pie

 

Conclusion

Postgresql text search capability has a lot of flexibility and power. In addition, it does not require any extra setup or configuration to get started. So, assuming Postgres is already part of your tech stack, it is worth investigating if it meets all of your requirements before bringing in a new library or application.