Friday, August 21, 2015

similarity() and trigram indexes

Continuing our adventures with watchlists...

Lets say we create a trigram index on our watchlist table:

CREATE INDEX watchlist_entry_gin_trigram_index
  ON watch_list_entry
  USING gin
  (entry COLLATE pg_catalog."default" gin_trgm_ops);

And now we want to use it.  This query will NOT use the index:

select *
from watch_list_entry
where similarity(entry, 'Tony Blair') > 0.5;

(221 secs)

But this pair of queries (which is functionally the same) WILL use the index:

select set_limit(0.5);

select *
from watch_list_entry
where similarity(entry, 'Tony Blair') > 0.5;

(34 secs)


Wednesday, August 19, 2015

the power of selective indexes

Today I was working for a client who was having a bit of a performance issue.  He has a table that looks like this:


Id Source Type Entry ParentID
1 FBI Name Charles Smith NULL
2 FBI Alias Chuck Smith 1
3 FBI Eye Color Brown 1
4 Interpol Name Richard Jones NULL
4Interpol Alias Dick Jones 4
6 Interpol Nationality UK 4


And so on, for about 12MM rows. (These are "wanted" lists -- and we get different attributes on a per-list or per-entry basis.) So what we want to do is let someone search by name, in which case we want to match any name or alias. We have a lot of lists, so we end up parallelizing the query like this:

select *
from watch_list
where source = 'FBI'
and similarity(name, 'Robert Cooper') > 0.5


But we were not getting the performance that we needed out of it that we needed. Enter "selective indexes", where we can limit the number of rows in the index, thus limiting the number of rows we have to search:

create index name_and_alias on watch_list (entry) where type in ('Name','Alias');


And boom!, we get the performance we need.