Best indexing alternative to speed up querying my millions of data in PostgreSQL

Question

I have a big table with more than 500 million rows. I'm trying to find the best indexing alternative to speed up the query time a bit. I suppose sorting according to timestamp slows the query time a lot. The table has 15 columns in it.

My Table has @ManyToOne relation to other_table. Users can define a max result as well. Code looks like this:

// I'm showing the query itself here instead of the name of @NamedQuery inside the entity class.
TypedQuery<MyTable> query = em.createNamedQuery("SELECT m FROM my_table m WHERE m.other_table.id = :id AND m.city in :cities ORDER BY m.timestamp DESC", MyTable.class);
query.setParameter("id", id);
query.setParameter("cities", cities);
query.setMaxResults(number);
return query.getResultList();

What is the best alternative for this type of query? A composite index? Which index type is most suitable in this case?

We have an index like this but as I said, it takes a long time with this.

CREATE INDEX my_table_idx ON my_schema.my_table USING btree (other_table_id, timestamp DESC NULLS LAST, city)

EDIT 1:

This is the execution plan:

Limit  (cost=2876886.98..2876887.03 rows=20 width=162) (actual time=101820.279..101820.284 rows=20 loops=1)
  Buffers: shared hit=8063 read=635649 written=12198
  ->  Sort  (cost=2876886.98..2879114.34 rows=890941 width=162) (actual time=101820.277..101820.278 rows=20 loops=1)
        Sort Key: timestamp DESC
        Sort Method: top-N heapsort  Memory: 35kB
        Buffers: shared hit=8063 read=635649 written=12198
  ->  Bitmap Heap Scan on my_table  (cost=31640.64..2853179.36 rows=890941 width=162) (actual time=199.824..101221.260 rows=711774 loops=1)
        Recheck Cond: ((m_other_table_id = '14b713d5-fb1a-4dbd-c013-fat4a7f6c8e3'::uuid) AND (m_city_id = 3))
        Rows Removed by Index Recheck: 28920837
        Heap Blocks: exact=23535 lossy=615808
        Buffers: shared hit=8060 read=635649 written=12198
        ->  Bitmap Index Scan on my_table_idx  (cost=0.00..31417.90 rows=890941 width=0) (actual time=189.011..189.012 rows=711777 loops=1)
              Index Cond: ((m_other_table_id = '14b713d5-fb1a-4dbd-c013-fat4a7f6c8e3'::uuid) AND (m_city_id = 3))
              Buffers: shared hit=90 read=4276
Planning time: 0.198 ms
Execution time: 101821.109 ms

and these are the indexes we have:

CREATE INDEX my_table_idx ON my_schema.my_table USING btree (other_table_id, timestamp DESC NULLS LAST, city)
CREATE UNIQUE INDEX my_table_prev_id_idx ON my_schema.my_table USING btree (m_prev_id)
CREATE INDEX my_table_other_table_fk_idx ON my_schema.my_table USING btree (m_other_table_id)
CREATE UNIQUE INDEX my_table_pkey ON my_schema.my_table USING btree (m_id)
CREATE INDEX my_table_track_fk_idx ON my_schema.my_table USING btree (m_track_id)

EDIT 2:

I wonder why parallel workers are not showing on my execution plan. I have these settings configured:

max_worker_processes = 6;
max_parallel_workers = 6;
max_parallel_workers_per_gather = 3;

@a_horse_with_no_name I updated my question. I hope I managed to do it as you asked. — gozluklu_marti, Mar 26 '20 at 14:16
It seems your `work_mem` is to small to efficiently process the bitmap index scan (that's why you have so many lossy heap blocks). You can try to increase that substantially and see how if that improves the performance. Changing the definition of `my_table_idx` to `(other_table_id, city, timestamp)` might also help (the timestamp column in there won't really help anyway) — , Mar 26 '20 at 14:26
@a_horse_with_no_name do you think `btree` is the best index type for this kind of data? — oxyt, Mar 27 '20 at 08:45
You provided some good information. But essentials are missing: Postgres version, table definition (`CREATE TABLE` statement). Consider instructions here: https://stackoverflow.com/tags/postgresql-performance/info — Erwin Brandstetter, Apr 02 '20 at 00:41

score 1 · Answer 1 · answered Apr 01 '20 at 10:21

Based on this line from the plan:

Cond: ((m_other_table_id = '14b713d5-fb1a-4dbd-c013-fat4a7f6c8e3'::uuid) AND (m_city_id = 3))

your ideal index should be on (m_other_table_id, m_city_id). Not other_table_id, not city. The plan you showed doesn't really match the query, so it is hard to tell where the typos are - in the shown plan or in the shown query.

Also, since you order by timestamp, we can add it to the index.

So, I would try the following index:

CREATE INDEX idx ON my_schema.my_table USING btree 
    (m_other_table_id, m_city_id, timestamp DESC)

The order of columns here is important.

score 0 · Answer 2 · answered Jun 28 '21 at 03:44

Do you need to retrieve 1/2 billion tuples? I doubt it.

Perhaps the question is: are you planning to process this result in your program to do something else? Perhaps you can push that work to the DBMS instead so you only receive the tuples you need.

Regarding your query. I believe the problem is that you doing a join by one attribute, and sorting by another.

Edit: I looked at your explain. The explain has a LIMIT. Your query does not.

So you are joining EVERY SINGLE TUPLE to find the ones with maximum timestamps?

Do the following:

Write your query with a subquery, where you retrieve the distinct timestaps, sort them and keep the N most recent.

Then search for joined tuples that have this timestamp. Something like this:

select * from a join b on (whetever join condition) where timestamp IN (select distinct timestamp from ... order by timestamp desc limit 100)

Best indexing alternative to speed up querying my millions of data in PostgreSQL

2 Answers2