1

I am working on a search system for a database of git commits. I am currently using full text search to enable a user to search by author, commit date, the log message, and commit hash. Currently, the commit hash is only useful if the user provides the entire commit hash, which is long and hard to remember, but useful for specifying a single commit.

The query for querying the database is essentially this:

SELECT
    cid,
    (ts_rank(tsv, q) + ts_rank_cd(tsv, q)) AS rank
FROM
    search,
    plainto_tsquery(%(query)s) AS q
WHERE
    (tsv @@ q);

where cid is the commit hash and the tsv is the text search vector of relevant information for each commit.

My goal is to allow users to only provide a portion of the commit hash in their query, and provide all commits that basically follow from their input.

I've looked into trigrams, which look the most promising, but I'm not entirely sure how to integrate them into this query.

Evan
  • 508
  • 1
  • 5
  • 18
  • While not simply `cid LIKE '1234...%'` ? The short form of a commit-id is always left-anchored, isn't it? – Daniel Vérité Jun 16 '17 at 13:31
  • I only have one search bar for all types of text, so I can't necessarily even determine if they have entered a cid. While that seems like a reasonable assumption, I haven't verified that formally. – Evan Jun 16 '17 at 14:01
  • Not really an answer but FWIW I don't see how to implement this search, including with trigrams, without first trying to recognize a potential short commit-id from the user input. – Daniel Vérité Jun 19 '17 at 17:48
  • 1
    The way I was thinking of doing it was to simply compute the tri-gram for each commit hash, and add that list to the search vector, but I can't figure out how to get the results of show_trgm into the search vector. Another way would be to look through the query and find anything that "looks" like a hash and try that against all hashes. – Evan Jun 21 '17 at 13:32
  • It looks like I can use array_to_tsvector(show_trgm(commit_hash)) along with everything else. I'll give this a try. – Evan Jun 21 '17 at 13:35
  • That didn't work how I had anticipated it working. – Evan Jun 22 '17 at 19:52

1 Answers1

2

1: Create columns/a view/materialized view of the tsvectors.

CREATE MATERIALIZED VIEW unique_lexeme AS
SELECT word FROM ts_stat(
'SELECT to_tsvector('simple', post.title) || 
    to_tsvector('simple', post.content) ||
    to_tsvector('simple', author.name) ||
    to_tsvector('simple', coalesce(string_agg(tag.name, ' ')))
FROM post
JOIN author ON author.id = post.author_id
JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
JOIN tag ON tag.id = posts_tags.tag_id
GROUP BY post.id, author.id');

2: select from this column using trigram

SELECT word
FROM unique_lexeme
WHERE similarity(word, 'samething') > 0.5 
ORDER BY word <-> 'samething';

(Search in this site: Mispelling http://rachbelaid.com/postgres-full-text-search-is-good-enough/)

3: when you found the words, use them to rank the results. With subquery:

SELECT word WHERE similarity(word, 'samething') > 0.5 ORDER BY word <-> 'samething';

Alternatively you could just create a subquery where you check for similarity.

Additions:

Index the tsvector columns.

Refresh the materialized view concurrently (http://www.postgresqltutorial.com/postgresql-materialized-views/).

Use triggers to update columns (https://www.postgresql.org/docs/9.0/textsearch-features.html)