1

I'm having a table of locations(29 million rows approx)

Table "public.locations"
Column   |   Type| Modifiers  
------------------------------------+-------------------+------------------------------------------------------------
id | integer   | not null default nextval('locations_id_seq'::regclass)
dl | text  | 
Indexes:
"locations_pkey" PRIMARY KEY, btree (id)
"locations_test_idx" gin (to_tsvector('english'::regconfig, dl))

I want the following query to have perform well.

EXPLAIN (ANALYZE,BUFFERS) SELECT id  FROM locations WHERE  to_tsvector('english'::regconfig, dl)  @@ to_tsquery('Lymps') LIMIT 10;

But the query plan produced show sequential scan being used.

                                                          QUERY PLAN                                                           

-------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..65.18 rows=10 width=4) (actual time=62217.569..62217.569 rows=0 loops=1)
  Buffers: shared hit=262 read=447808
  I/O Timings: read=861.370
  ->  Seq Scan on locations  (cost=0.00..967615.99 rows=148442 width=2) (actual time=62217.567..62217.567 rows=0 loops=1)
         Filter: (to_tsvector('english'::regconfig, dl) @@ to_tsquery('Lymps'::text))
         Rows Removed by Filter: 29688342
         Buffers: shared hit=262 read=447808
         I/O Timings: read=861.370
Planning time: 0.109 ms
Execution time: 62217.584 ms

Upon forcibly turning off seq scan

set enable_seqscan to off;

The query plan now uses the gin index.

                                                                  QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1382.43..1403.20 rows=10 width=2) (actual time=0.043..0.043 rows=0 loops=1)
   Buffers: shared hit=1 read=3
   ->  Bitmap Heap Scan on locations  (cost=1382.43..309697.73 rows=148442 width=2) (actual time=0.043..0.043 rows=0 loops=1)
         Recheck Cond: (to_tsvector('english'::regconfig, dl) @@ to_tsquery('Lymps'::text))
         Buffers: shared hit=1 read=3
         ->  Bitmap Index Scan on locations_test_idx  (cost=0.00..1345.32 rows=148442 width=0) (actual time=0.041..0.041 rows=0 loops=1)
               Index Cond: (to_tsvector('english'::regconfig, dl) @@ to_tsquery('Lymps'::text))
               Buffers: shared hit=1 read=3
 Planning time: 0.089 ms
 Execution time: 0.069 ms
(10 rows)

The cost settings have been pasted below.

select name,setting from pg_settings where name like '%cost';                       
         name         | setting 
----------------------+---------
 cpu_index_tuple_cost | 0.005
 cpu_operator_cost    | 0.0025
 cpu_tuple_cost       | 0.01
 random_page_cost     | 4
 seq_page_cost        | 1
(5 rows)

I'm looking for a solution which doesn't use sequential scan for the aforementioned query and tricks like setting sequential scan to off.

I tried to update the value of seq_page_cost to 20 but the query plan remained the same.

user2512324
  • 791
  • 1
  • 6
  • 21
  • As a first attempt, run `ANALYZE` to update statistics. If that doesn't work, increase `default_statistics_target` and retry. – Laurenz Albe Sep 25 '17 at 13:57
  • I ran `VACUUM FULL ANALYZE locations ` but it didn't help. Also `SET default_statistics_target to 10000;` didn't help either. – user2512324 Sep 25 '17 at 14:13
  • 1
    Run `ANALYZE` (not `VACUUM FULL`) after changing `default_statistics_target`. Still no improvement? – Laurenz Albe Sep 25 '17 at 14:32
  • Actually I ran `VACUUM FULL ANALYZE locations` before setting `default_statistics_target `. After setting `default_statistics_target ` and then running `VACUUM FULL ANALYZE` changes the query plan to use the GIN index. I guess `ANALYZE` is the key here. Thanks very much for the suggestion. – user2512324 Sep 25 '17 at 14:45

1 Answers1

1

The problem here is that PostgreSQL thinks that there are enough rows that satisfy the condition, so it thinks it can be cheaper by sequentially fetching rows until it has 10 that match.

But there is not a single row that satisfies the condition, so the query ends up scanning the whole table, when an index scan would have found that much faster.

You can improve the quality of statistics collected for that column like this:

ALTER TABLE locations_test_idx
   ALTER to_tsvector SET STATISTICS 10000;

Then run ANALYZE, and PostgreSQL will collect better statistics for that column, hopefully improving the query plan.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263