Getting Started with Django and PostgreSQL Full-Text Search
After hearing a few months back that PostgreSQL had built-in full-text search support, I had been continually searching for reasons to take advantage of it. A few months went by, and when one of our clients at Lionheart started running into issues with slowness using the Django admin's built-in search support, I knew I had a solid candidate. After all was said and done, searches returned results approximately 20x faster than vanilla Django. Here's how you can do it too.
The Basics
One of the great things about PostgreSQL full-text search is that there isn't too much to learn if you're already pretty familiar with search engines and how they work. The main idea is that you generate an index using a document containing terms you'd like to search on. PostgreSQL provides a lot of flexibility with regards to how you choose to do this, but I opted to store the documents as a field called fts_document
on the table we wanted to search on.
PostgreSQL stores search data in a data type called a tsvector
. Here's how the PostgreSQL documentation describes it:
A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word. Sorting and duplicate-elimination are done automatically during input.
Here's an example of a sentence converted to a tsvector:
When you want to search for something, you create what's called a tsquery
.
A tsquery value stores lexemes that are to be searched for, and combines them honoring the Boolean operators & (AND), | (OR), and ! (NOT). Parentheses can be used to enforce grouping of the operators
Here's an example:
When you want to test for a match, you use the @@
operator.
It's a match!
The @@
operator also provides a shorthand whereby you don't need to use the to_tsvector
/ to_tsquery
functions explicitly. E.g.,
Sweet! However, you might be confused by the following:
Huh? We're searching for either the term "fox" or "random". This doesn't seem right…
Well, it turns out that when you specify plain text as an argument to the @@
operator, it wraps it in plainto_tsquery
, which functions a little bit differently than the to_tsquery
function that we've been playing around with above. Notably, it ignores every logical operator except &
. So |
's, parentheses, etc., will all be ignored. Personally, I don't find it to be all that useful.
Also, as you might have noticed above, I provided an additional argument to both the to_tsvector
and to_tsquery
functions called 'simple'. This represents the simple text search configuration. Other text search configurations are language-based and strip out stopwards and other lexemes based on what language you might be indexing. Here's a few other options that I have available on my machine.
Play around with a few and see which one suits you best.
Django
By now you know how to create search vectors and queries, search on some data, and specify a text search configuration. Let's get started on connecting these tools with your Django project.
Since tsvector
and tsquery
are both standard PostgreSQL data types, we can use them just as we might any other data type like int
or varchar
in a column definition. This is very handy. Otherwise, we would have to regenerate vectors on-the-fly, which might offset any speedup we get from using FTS in the first place. :)
For the sake of moving forward, let's say we have a model called Employee
in our Django project. We're trying to speed up searches on anything related to an employee. Assume we're storing their first name, last name, and full address.
1
2
3
4
5
6
7
8
class Employee(db.Model):
first_name = models.CharField(max_length=64)
last_name = models.CharField(max_length=64)
address_1 = models.TextField()
address_2 = models.TextField()
city = models.CharField(max_length=64)
state = models.CharField(max_length=2)
zip = models.CharField(max_length=9)
We're now going to create an tsvector
based on these fields. To do this, we'll concatenate all of the fields into a list and then flatten the list into a string to feed into the to_tsvector
function.
If we wanted to stop here, we could just do FTS matches on the fly, but this would be quite slow.
The next step is to store this tsvector
in a field and keep it updated when the data changes. First step seems easy enough:
Hmm, but how to we keep this updated? There are a bunch of options, but according to the PostgreSQL documention, the best one is to add a trigger on document updates and additions. To make things easy, we're going to create a function to generate the fts document. E.g.,
1
2
3
4
5
6
7
8
CREATE FUNCTION employee_fts_document(e employee) RETURNS tsvector AS $$
DECLARE
employee_document TEXT;
BEGIN
SELECT concat_ws(' ', e.first_name, e.last_name, e.address_1, e.address_2, e.city, e.state, e.zip) INTO employee_document;
RETURN to_tsvector('pg_catalog.simple', employee_document);
END;
$$ LANGUAGE plpgsql;
We can then reference this function in the trigger.
1
2
3
4
5
6
CREATE FUNCTION employee_fts_document_trigger() RETURNS TRIGGER AS $$
BEGIN
NEW.fts_document=employee_fts_document(NEW);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
Finally, we tell PostgreSQL to execute the trigger on employee updates and inserts.
We're almost done. The last thing we might want to do is to tell PostgreSQL that we'd like to index the fts_document
column which contains our full-text search vectors. We're going to add a GIN (Generalized Inverted Index)-based index to our column. You can read about the GIN-based and GiST-based indices in the PostgreSQL documentation. Here's how we add our index.
That's it. Our employee table now has an always up-to-date search index that we can use. Before we continue, let's instantiate the search vectors. We'll then test everything out to make sure things are working a-ok.
Success!
South
Now that we understand the internals, wouldn't it be nice if we could keep our databases up-to-date without having to manually copy and paste commands on our PostgreSQL shell? South to the rescue!
We'll need to generate a blank migration.
Next, open up the migration, and copy in the SQL above in the forwards
and backwards
methods, as illustrated below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# ... imports, etc ...
from django.db import connection
class Migration(SchemaMigration):
def forwards(self, orm):
sql = """
CREATE FUNCTION employee_fts_document(integer) RETURNS tsvector AS $$
DECLARE
employee_document TEXT;
BEGIN
SELECT concat_ws(' ', first_name, last_name, address_1, address_2, city, state, zip) INTO employee_document FROM employee WHERE id=$1;
RETURN to_tsvector('pg_catalog.simple', employee_document);
END;
$$ LANGUAGE plpgsql;
CREATE FUNCTION employee_fts_document_trigger() RETURNS TRIGGER AS $$
BEGIN
NEW.fts_document=employee_fts_document(NEW.id);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
"""
cursor = connection.cursor()
cursor.execute(sql);
cursor.execute("ALTER TABLE employee ADD COLUMN fts_document tsvector")
cursor.execute("UPDATE employee SET fts_document=employee_fts_document(id)");
cursor.execute("CREATE TRIGGER employee_fts_update_trigger BEFORE UPDATE ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger()")
cursor.execute("CREATE TRIGGER employee_fts_insert_trigger BEFORE INSERT ON employee FOR EACH ROW EXECUTE PROCEDURE employee_fts_document_trigger()")
cursor.execute("CREATE INDEX employee_fts_index ON employee USING gin(fts_document)")
def backwards(self, orm):
cursor = connection.cursor()
cursor.execute("DROP INDEX employee_fts_index")
cursor.execute("ALTER TABLE employee DROP COLUMN fts_document")
cursor.execute("DROP TRIGGER employee_fts_update_trigger ON employee")
cursor.execute("DROP TRIGGER employee_fts_insert_trigger ON employee")
cursor.execute("DROP FUNCTION employee_fts_document (integer)")
cursor.execute("DROP FUNCTION employee_fts_document_trigger ()")
models = {
"""
App-specific model definitions here...
"""
}
Now, just run this migration…
…and watch as your FTS column, indices, and triggers are automagically created.
The Django Admin
OK, so we have our FTS document created, have generated a migration to keep things in sync across machines, and are indexing everything we might possibly want to search on to find an employee. The next step is to use what we have and integrate it into the Django Admin. This assumes that you have an EmployeeAdmin
object similar to the below:
Our goal is to make searching with the Django Admin use our shiny new FTS index. To do this, we're going to need to override the get_search_results
method in EmployeeAdmin
to something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class EmployeeAdmin(admin.ModelAdmin):
# ...
def get_search_results(self, request, queryset, search_term):
if search_term == '':
queryset, distinct = super(EmployeeAdmin, self).get_search_results(request, queryset, search_term)
else:
queryset, distinct = super(EmployeeAdmin, self) \
.get_search_results(request, queryset, None)
queryset = queryset.extra(
where=["employee.fts_document @@ to_tsquery('simple', %s)"],
params=[search_term]
)
return queryset, distinct
Looks close, but we're missing something. to_tsquery
will expect a "properly-formed" query, and we can't always trust or depend on users of the admin to know PostgreSQL search syntax (don't blame them). We need to somehow sanitize the input before feeding it into PostgreSQL.
Here are some examples of things that we might want to handle:
- Strip padded whitespace.
- Remove duplicated whitespace.
- Add &'s between words that are only separated by spaces.
- Replace double quotes (which PostgreSQL doesn't like) to single quotes.
- Clear up situations where search operators are used ambiguously (i.e., "ron & | swanson")
- Add a prefix wildcard to every search term, so that a search for "swan" will return "Ron Swanson" in the search results.
Luckily for you, I've already done all of the heavy lifting. Below is a function called sanitize_search_term
which will take a string and form it into a proper PostgreSQL search query.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import re
import string
def sanitize_search_term(term):
# Replace all puncuation with spaces.
allowed_punctuation = set(['&', '|', '"', "'"])
all_punctuation = set(string.punctuation)
punctuation = "".join(all_punctuation - allowed_punctuation)
term = re.sub(r"[{}]+".format(re.escape(punctuation)), " ", \
term)
# Substitute all double quotes to single quotes.
term = term.replace('"', "'")
term = re.sub(r"[']+", "'", term)
# Create regex to find strings within quotes.
quoted_strings_re = re.compile(r"('[^']*')")
space_between_words_re = re.compile(r'([^ &|])[ ]+([^ &|])')
spaces_surrounding_letter_re = re.compile(r'[ ]+([^ &|])[ ]+')
multiple_operator_re = re.compile(r"[ &]+(&|\|)[ &]+")
tokens = quoted_strings_re.split(term)
processed_tokens = []
for token in tokens:
# Remove all surrounding whitespace.
token = token.strip()
if token in ['', "'"]:
continue
if token[0] != "'":
# Surround single letters with &'s
token = spaces_surrounding_letter_re.sub(r' & \1 & ', token)
# Specify '&' between words that have neither | or & specified.
token = space_between_words_re.sub(r'\1 & \2', token)
# Add a prefix wildcard to every search term.
token = re.sub(r'([^ &|]+)', r'\1:*', token)
processed_tokens.append(token)
term = " & ".join(processed_tokens)
# Replace ampersands or pipes surrounded by ampersands.
term = multiple_operator_re.sub(r" \1 ", term)
# Escape single quotes
return term.replace("'", "''")
Combining everything together, we get the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class EmployeeAdmin(admin.ModelAdmin):
# ...
def get_search_results(self, request, queryset, search_term):
if search_term == '':
queryset, distinct = super(EmployeeAdmin, self).get_search_results(request, queryset, search_term)
else:
search_term = sanitize_search_term(search_term)
queryset, distinct = super(EmployeeAdmin, self) \
.get_search_results(request, queryset, None)
queryset = queryset.extra(
where=["employee.fts_document @@ to_tsquery('simple', %s)"],
params=[search_term]
)
return queryset, distinct
And…we're done! Searches in the admin will now use PostgreSQL FTS.
Here's to speedy searching!
EDIT 12/15/2014: In a previous version of this post, employee_fts_document
would pull stale values from the DB while generating the FTS document and performed an extraneous SELECT
. Thanks, Roger!