5

I am having an issue with the following PostgreSQL query it takes more than 10 seconds to run is there any way to speed this query up to a rational speed, I am simply looking for the most relevant search terms associated with videos on a very large database.

  SELECT count(*), videoid 
  FROM term_search 
  where word = 'tester' 
     OR word = 'question' 
     OR word = 'one' 
  group by videoid 
  order by count(*) desc 
  limit 1800;

When the query is run with analyze the resultant query plan is as follows (http://explain.depesz.com/s/yDJ):

  Limit  (cost=389625.50..389630.00 rows=1800 width=4) (actual time=11766.693..11770.001 rows=1800 loops=1)
     Output: (count(*)), videoid
     ->  Sort  (cost=389625.50..389692.68 rows=26873 width=4) (actual time=11766.689..11767.818 rows=1800 loops=1)
           Output: (count(*)), videoid
           Sort Key: (count(*))
           Sort Method: top-N heapsort  Memory: 181kB
           ->  HashAggregate  (cost=387769.41..388038.14 rows=26873 width=4) (actual time=9215.653..10641.993 rows=1632578 loops=1)
                 Output: count(*), videoid
                 ->  Bitmap Heap Scan on public.term_search  (cost=44915.83..378163.38 rows=1921207 width=4) (actual time=312.449..7026.036 rows=2047691 loops=1)
                       Output: id, videoid, word, termindex, weight
                       Recheck Cond: (((term_search.word)::text = 'tester'::text) OR ((term_search.word)::text = 'question'::text) OR ((term_search.word)::text = 'one'::text))
                       Rows Removed by Index Recheck: 25512434
                       ->  BitmapOr  (cost=44915.83..44915.83 rows=1950031 width=0) (actual time=288.937..288.937 rows=0 loops=1)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..8552.83 rows=383502 width=0) (actual time=89.266..89.266 rows=419750 loops=1)
                                   Index Cond: ((term_search.word)::text = 'tester'::text)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..13171.84 rows=590836 width=0) (actual time=89.700..89.700 rows=604348 loops=1)
                                   Index Cond: ((term_search.word)::text = 'question'::text)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..21750.26 rows=975693 width=0) (actual time=109.964..109.964 rows=1023593 loops=1)
                                   Index Cond: ((term_search.word)::text = 'one'::text)

The schema for the table is as follows:

    Column   |          Type          |                        Modifiers                         | Storage  | Description 
  -----------+------------------------+----------------------------------------------------------+----------+-------------
   id        | integer                | not null default nextval('term_search_id_seq'::regclass) | plain    | 
   videoid   | integer                |                                                          | plain    | 
   word      | character varying(100) |                                                          | extended | 
   termindex | character varying(15)  |                                                          | extended | 
   weight    | smallint               |                                                          | plain    | 
  Indexes:
      "term_search_pkey" PRIMARY KEY, btree (id)
      "search_term_exists_idx" btree (videoid, word)
      "terms_caverphone_idx" btree (termindex)
      "terms_video_idx" btree (videoid)
      "terms_word_idx" btree (word, videoid)
  Foreign-key constraints:
      "term_search_videoid_fkey" FOREIGN KEY (videoid) REFERENCES videos(id) ON DELETE CASCADE
  Has OIDs: no

I have managed to get it down to 7 seconds with Index Only scans but it was still not low enough. I am running PostgreSQL 9.3 on Ubuntu 14.04 on an aws r3.xlarge instance, with approx 50 million rows in the table. Any advice is greatly appreciated!

EDIT:

Attached is the result of SELECT schemaname,tablename,attname,null_frac,avg_width,n_distinct FROM pg_stats WHERE schemaname='public' and tablename='term_search';

 schemaname |  tablename  |  attname  | null_frac | avg_width | n_distinct 
 ------------+-------------+-----------+-----------+-----------+------------
 public     | term_search | id        |         0 |         4 |         -1
 public     | term_search | videoid   |         0 |         4 |     568632
 public     | term_search | word      |         0 |         6 |       5054
 public     | term_search | termindex |         0 |        11 |       2485
 public     | term_search | weight    |         0 |         2 |          3
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
user3354369
  • 113
  • 1
  • 6
  • 1
    Any difference if you use `word IN ('tester','question','one')` ? – StanislavL May 29 '14 at 06:31
  • I tried in and ANY instead of or and not much of a difference. Total runtime: 10931.711 ms – user3354369 May 29 '14 at 06:38
  • 1
    Can you change table `term_search` structure? How many distinct values does column `word` have? – klin May 29 '14 at 06:44
  • @user3354369 PostgreSQL transforms `IN` to `= ANY` internally so you won't see any difference there. Please show the plan with index-only scans, and explain how you got that instead. – Craig Ringer May 29 '14 at 07:05
  • Please, add the output of `SELECT schemaname,tablename,attname,null_frac,avg_width,n_distinct FROM pg_stats WHERE schemaname='public' and tablename='term_search';` – vyegorov May 29 '14 at 07:13
  • @klin Yep I can change the structure to whatever. I am collecting a query on distinct values of word, I'll post tmrw morn – user3354369 May 29 '14 at 07:47
  • @CraigRinger When the table was autovacuumed more frequently the query planner would opt for index only scans. I run autovacuum with much less agressive params because the search performance is much much worse when its running. Ill poke a manual vacuum and post the result tmrw morn – user3354369 May 29 '14 at 07:51
  • Well, according to `EXPLAIN` output, `GROUP BY videoid` returns 1.63 mln rows, yet stats shows only 0.57 mln of distinct values. I would recommend you to increase your statistics target for this table and re-analyzing your table. – vyegorov May 29 '14 at 08:08
  • @vyegorov That's only out by 2x, which is perfectly reasonable. Remember, it's a *statistical sampling*, an estimation. Additionally, higher stats targets incur higher costs for stats collection and for planning, they aren't free (otherwise they'd be very high by default). You should generally get concerned when stats are out by orders of magnitude, except for the special case of 1 vs not-1. – Craig Ringer May 29 '14 at 11:25

4 Answers4

1

If I have a chance to disconnect users for a night I would:

  • create a new table with words from term_search,
  • create reference to the new table,
  • drop column word,

something like this:

create table words (
    word_id serial primary key,
    word text);

insert into words (word)
    select distinct word
    from term_search;

alter table term_search add column word_id integer;

update term_search t
    set word_id = w.word_id
    from words w
    where t.word = w.word;

alter table term_search add constraint term_search_word_fkey 
    foreign key (word_id) references words (word_id);

Test:

SELECT count(*), videoid 
FROM term_search t
JOIN words w on t.word_id = w.word_id
WHERE w.word = 'tester' 
   OR w.word = 'question' 
   OR w.word = 'one' 
GROUP BY videoid 
ORDER BY count(*) desc 
LIMIT 1800;    

-- if was faster then
    alter table term_search drop column word;
-- and on the fly...
    alter table term_search alter termindex type text;

After the revolution I'd have to take care of inserts and updates on term_search. I'd probably create a view with rules for insert and update.

klin
  • 112,967
  • 15
  • 204
  • 232
  • 2
    I think this is a good idea: making `term_search` a bridge table between videos and words. Next step would be to make `{video_id, word_id}` the PK (forcing both fields to be NOT NULL, too) ; and dropping the surrogate key `id`. – wildplasser May 29 '14 at 08:50
1

Let's start by rephrasing the query to explain what it's really trying to do.

The query:

  SELECT count(*), videoid 
  FROM term_search 
  where word = 'tester' 
     OR word = 'question' 
     OR word = 'one' 
  group by videoid 
  order by count(*) desc 
  limit 1800;

seems to mean:

"In a table of search terms, find me videos with the search terms tester, question or one. Count the matches for each video and return the 1800 videos with the most matches".

or, more generally:

"Find me the videos that best match my search terms and show me the top n best matches".

Correct?

If so, why aren't you using PostgreSQL's built-in full-text search and full-text indexing? An indexed tsquery match against a tsvector per video is likely to be a win here. Full-text search has fuzzy matching, ranking, and pretty much everything else you're going to want - and unlike your current approach it won't require the whole data set to be materialized and sorted only to discard most of it.

You haven't provided sample data, so I can't really do a demo.


How PostgreSQL currently executes your query could be explained like this:

  • Create a map with one bit for every disk page (8kb) in the table, where true indicates that the page might contain one or more matching rows.

  • For each search term, scan the index terms_word_idx and update the bitmap to set the bit where a match is found

  • Scan the table, skipping over pages where the bitmap says there can't be any matches, looking for rows that have any of the words. This is like a fast, skip-over-blanks seqscan. It's actually not tons faster than a plain seqscan if the percentage of matches is high.

  • For each matching row, sort it into a series of "buckets" based on the video id. Then at the end, count how many rows are in each bucket and return the count + the video ID for that bucket. (It's not that simple, but close enough).

  • As you count each bucket, put the result in between the results with next-highest and next-lowest counts.

    • Take the top 1800 results, and throw away all the rest of your hard work.

That doesn't sound like much fun, but it doesn't have any choice. A b-tree index can't be descended to search simultaneously for multiple terms, so it has to do multiple index scans. The rest kind of follows from that.

So: to make this more efficient, you need to fundamentally change how you tackle the problem. Adding an index or tuning some parameters isn't going to suddenly make this take 0.5s.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
0

You can optimize postgresql settings for decreasing time of query executions. For example, you can use pgtune utilite:

apt-get install pgtune
cd /etc/postgresql/*.*/main/
cp postgresql.conf postgresql.conf.default
pgtune -i postgresql.conf.default -o postgresql.conf --type=%TYPE%

Here %TYPE% is one of values:

  • DATA for big data massives, big queryes, low frequency calling
  • WEB for web-applications, best for Django-applications and other WEB-apps

Other info about pgtune you can found in the Google and help.

For PostgreSQL < 9.3 you must use this script:

#!/bin/bash
# simple shmsetup script
page_size=`getconf PAGE_SIZE`
phys_pages=`getconf _PHYS_PAGES`
shmall=`expr $phys_pages / 2`
shmmax=`expr $shmall \* $page_size`
echo kernel.shmmax = $shmmax
echo kernel.shmall = $shmall

White results into file /etc/sysctl.conf and reboot system. Otherwise Postgres can't start.

Dunaevsky Maxim
  • 3,062
  • 4
  • 21
  • 26
  • 2
    It's also a pretty rough tool at best, and is far from a magic "make my database faster". When confronting a specific query you're better off looking at the query's plan, stats, etc. – Craig Ringer May 29 '14 at 11:27
0

Others have provided some advice for how to refactor the database, but you can probably make the query run better as it currently is. The following line from the EXPLAIN suggests that your bitmaps are overflowing:

Rows Removed by Index Recheck: 25512434

If that recheck is what is consuming the time (as opposed to IO consuming the time--if you ran EXPLAIN (ANALYZE, BUFFERS) it would help clarify that, especially if you track_io_timing is on) then increasing work_mem could help a lot, assuming you can afford to do so without running out of RAM.

jjanes
  • 37,812
  • 5
  • 27
  • 34