1

I have two table one is migrate_data which is storing product details and another one is ws_data storing the d_id and parent(d_id's parent) where d_id is the process id

I am having million records and want to join this two table using left join and will be creating child, parent, grandparent using d_id and parent

Dummy Sample data (Generation):

CREATE TABLE ws_data (
  id SERIAL UNIQUE NOT NULL,
    d_id integer,
    parent integer,
  CONSTRAINT ws_data_pk
   PRIMARY KEY (id)
);

CREATE TABLE migrate_data (
  id SERIAL UNIQUE NOT NULL,
  d_id_sell integer,
  d_id_curr integer,
  product_name VARCHAR(100) NOT NULL, -- not unique
  CONSTRAINT migrate_data_pk
   PRIMARY KEY (id)
);


insert into migrate_data (
    product_name,
    d_id_sell,
    d_id_curr
)
select
    md5(random()::text),
    floor(random() * (1000 + 1)),
    floor(random() * (1000 + 1))
from generate_series(1, 10000);

insert into ws_data (
    parent,
    d_id
)
select
    floor(random() * (1000 + 1)),
    floor(random() * (1000 + 1))
from generate_series(1, 10000);

DELETE FROM ws_data T1
WHERE   T1.d_id = T1.parent;  

DELETE FROM ws_data T1
    USING   ws_data T2
WHERE   T1.id < T2.id  -- delete the older versions
    AND T1.parent = T2.parent AND T1.d_id = T2.d_id;  -- add more columns if needed

Main Query:

explain (analyze, buffers, format text)
select a.d_id_sell, a.d_id_curr, a.product_name
from migrate_data a
    left join ws_data t1_parent on t1_parent.d_id = a.d_id_sell
            left join ws_data t1_grandparent on t1_grandparent.d_id = t1_parent.parent
            left join ws_data t2_parent on t2_parent.d_id = a.d_id_curr
            left join ws_data t2_grandparent on t2_grandparent.d_id = t2_parent.parent;

Analysis:

"Hash Right Join  (cost=36223.20..1292695.17 rows=97566770 width=41) (actual time=224.185..4777.545 rows=96825218 loops=1)"
"  Hash Cond: (t2_parent.d_id = a.d_id_current)"
"  Buffers: shared hit=314, temp read=7415 written=7415"
"  ->  Hash Right Join  (cost=280.00..1740.20 rows=99270 width=4) (actual time=1.110..6.031 rows=98682 loops=1)"
"        Hash Cond: (t2_grandparent.d_id = t2_parent.parent)"
"        Buffers: shared hit=110"
"        ->  Seq Scan on ws_data t2_grandparent  (cost=0.00..155.00 rows=10000 width=4) (actual time=0.002..0.360 rows=9944 loops=1)"
"              Buffers: shared hit=55"
"        ->  Hash  (cost=155.00..155.00 rows=10000 width=8) (actual time=1.059..1.059 rows=9944 loops=1)"
"              Buckets: 16384  Batches: 1  Memory Usage: 517kB"
"              Buffers: shared hit=55"
"              ->  Seq Scan on ws_data t2_parent  (cost=0.00..155.00 rows=10000 width=8) (actual time=0.008..0.470 rows=9944 loops=1)"
"                    Buffers: shared hit=55"
"  ->  Hash  (cost=14914.56..14914.56 rows=987731 width=41) (actual time=221.897..221.898 rows=981173 loops=1)"
"        Buckets: 65536  Batches: 32  Memory Usage: 2891kB"
"        Buffers: shared hit=204, temp written=7086"
"        ->  Hash Right Join  (cost=599.00..14914.56 rows=987731 width=41) (actual time=9.566..75.486 rows=981173 loops=1)"
"              Hash Cond: (t1_parent.d_id = a.d_id_seller)"
"              Buffers: shared hit=204"
"              ->  Hash Right Join  (cost=280.00..1740.20 rows=99270 width=4) (actual time=3.302..10.555 rows=98682 loops=1)"
"                    Hash Cond: (t1_grandparent.d_id = t1_parent.parent)"
"                    Buffers: shared hit=110"
"                    ->  Seq Scan on ws_data t1_grandparent  (cost=0.00..155.00 rows=10000 width=4) (actual time=0.007..0.508 rows=9944 loops=1)"
"                          Buffers: shared hit=55"
"                    ->  Hash  (cost=155.00..155.00 rows=10000 width=8) (actual time=3.256..3.257 rows=9944 loops=1)"
"                          Buckets: 16384  Batches: 1  Memory Usage: 517kB"
"                          Buffers: shared hit=55"
"                          ->  Seq Scan on ws_data t1_parent  (cost=0.00..155.00 rows=10000 width=8) (actual time=0.018..1.623 rows=9944 loops=1)"
"                                Buffers: shared hit=55"
"              ->  Hash  (cost=194.00..194.00 rows=10000 width=41) (actual time=6.223..6.224 rows=10000 loops=1)"
"                    Buckets: 16384  Batches: 1  Memory Usage: 841kB"
"                    Buffers: shared hit=94"
"                    ->  Seq Scan on migrate_data a  (cost=0.00..194.00 rows=10000 width=41) (actual time=0.034..3.721 rows=10000 loops=1)"
"                          Buffers: shared hit=94"
"Planning Time: 0.701 ms"

I am trying to reduce the query execution time was reading about the lateral but don't know how to use it. Plus I am not able to figure out how this left joining can be improve.

Any lead would be appreciated.

EDIT:

| grandparent | parent | child |
|-------------|--------|-------|
| 50          | 300    | 101   |
| 50          | 300    | 345   |
| 50          | 300    | 356   |
| 50          | 300    | 456   |
| 50          | 150    | 458   |
| 50          | 150    | 546   |
| 50          | 200    | 547   |
| 50          | 200    | 600   |
| 50          | 200    | 601   |
| 80          |        |       |
| 80          |        |       |
| 90          | 500    |       |
| 90          | 501    |       |
Pygirl
  • 12,969
  • 5
  • 30
  • 43
  • @a_horse_with_no_name updated the post. I have trimmed the columns `seller` to `sell` and `current` to `curr` in the query. – Pygirl Nov 03 '22 at 15:36
  • rows=96825218, in just 4 seconds? That's impressive. What kind of performance are you aiming for? – Frank Heikens Nov 03 '22 at 15:38
  • @FrankHeikens actually I reduced the generated data from `1000000` to `10000` as it was taking time. so just to check the logic I reduced the number. but in real scenario I will have lots of data and that query is almost similar to this one just the `ws_data` will have more columns (related to details of a product and also having `coalesce() as <>` in select. With real data I was getting timed out so was trying to optimize the query on the generated data for testing as I am not having access to the data. – Pygirl Nov 03 '22 at 15:39
  • Does it make sense that each child has about 10 parents and 100 grandparents? – jjanes Nov 03 '22 at 16:37
  • @jjanes Actually it will be opposite child id will always be unique. `grandparent id -> parent id -> child id`. More no of unique id in child and then in parent and then in grandparent say total no of unique id(4000-childs,400-parents,50 grandparents) . I have added the example. child won't have many parents. – Pygirl Nov 03 '22 at 17:01
  • This is mostly a volume thing, there isn't much wrong with the query. If you need faster output you could try to split the fetches by e.g. `WHERE a.id BETWEEN 1 AND 10000` etc... that way the sorting and JOINing can happen all in RAM. You'll get results 'sooner' but you'll need to repeat and merge things on the client-side again. I'm guessing that the bigger the blocks are the faster things will be initially but you'll quickly find that once you hit a given threshold then performance will tank. Find the best block-size and make sure to leave some buffer. You'll also avoid time-outs like this. – deroby Nov 04 '22 at 18:07

1 Answers1

2

With a query that retrieves all data from the join, there is not really a lot you can do. The biggest savings can be had from increasing work_mem until no more temp files are written.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Oh then I can't improve this joining part but this `coalesce(a.type, '') type` in the select can be optimize? – Pygirl Nov 03 '22 at 16:27
  • 1
    No, that is typically way cheaper than the rest of the query. If the cost is significant, PostgreSQL can employ its JIT compiler to speed it up. – Laurenz Albe Nov 03 '22 at 17:03