13

I have two huge tables:

Table "public.tx_input1_new" (100,000,000 rows) 

     Column     |            Type             | Modifiers
----------------|-----------------------------|----------
 blk_hash       | character varying(500)      |
 blk_time       | timestamp without time zone |
 tx_hash        | character varying(500)      |
 input_tx_hash  | character varying(100)      |
 input_tx_index | smallint                    |
 input_addr     | character varying(500)      |
 input_val      | numeric                     |

Indexes:
    "tx_input1_new_h" btree (input_tx_hash, input_tx_index) 

Table "public.tx_output1_new" (100,000,000 rows)

    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 tx_hash      | character varying(100) |
 output_addr  | character varying(500) |
 output_index | smallint               |
 input_val    | numeric                |

Indexes:
    "tx_output1_new_h" btree (tx_hash, output_index)

I want to update table1 by the other table:

UPDATE tx_input1 as i
SET 
  input_addr = o.output_addr,
  input_val = o.output_val
FROM tx_output1 as o
WHERE 
  i.input_tx_hash = o.tx_hash
  AND i.input_tx_index = o.output_index;

Before I execute this SQL command, I already created the index for this two table:

CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index);

CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index);

I use EXPLAIN command to see the query plan, but it didn't use the index I created.

It took about 14-15 hours to complete this UPDATE.

What is the problem within it?

How can I shorten the execution time, or tune my database/table?

Thank you.

Ben H
  • 435
  • 5
  • 15
user3383856
  • 131
  • 1
  • 1
  • 6
  • 2
    Please **[EDIT]** your question and add the execution plan (again: [**formatted text**](http://stackoverflow.com/help/formatting) please –  Mar 02 '17 at 07:28
  • ok. I am sorry. – user3383856 Mar 02 '17 at 07:28
  • Your question is more about: [Why postgres is not using the index?](http://stackoverflow.com/search?q=postgres+why+not+using+index) – McNets Mar 02 '17 at 07:29
  • I have searched this topic, but it can't help. I think that it maybe not caused by bad indexing. – user3383856 Mar 02 '17 at 07:50
  • 1) How many different values does `input_tx_hash` contain in either table ? 2) how many records will actually change value by this update? – joop Mar 02 '17 at 10:31
  • @joop 1) tx_input1 has 60w distinct tx_hash, tx_output1 has 30w distinct tx_hash 2) the column (input_addr, input_val) are empty in the original table1 – user3383856 Mar 02 '17 at 15:02
  • And the `input_addr` and `o.output_addr` are also not unique ? (the reason why I am asking this: your tables seem relatively *fat*, with large text fields of almost-redundant strings) Smaller rows would mean: fewer pages to read/write from/to disk) – joop Mar 02 '17 at 15:11

1 Answers1

24

Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that.

First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that?

You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation:

  • Remove all indexes and constraints on tx_input1 before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the UPDATE. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.

  • Increase the work_mem parameter for this one operation with the SET command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.

  • Increase checkpoint_segments (or max_wal_size from version 9.6 on) to a high value so that there are fewer checkpoints during the UPDATE operation.

  • Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.

After the UPDATE, if it affects a big number of rows, you might consider to run VACUUM (FULL) on tx_input1 to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • May I ask you how to build a hash structure table and do the hash join? – user3383856 Mar 02 '17 at 14:23
  • 1
    You don't have to do that, the database does it for you. I just explained to you how the database executes your join so that you understand how to speed it up. – Laurenz Albe Mar 02 '17 at 19:43
  • 6
    Wow! Dropping 25 indices, setting `work_mem` from 2 to 4GB, `checkpoint_timeout` from 5min to 1h and `max_wal_size` from 1 to 30GB reduced an update in a 70 million rows table from 23 hours down to 30 minutes. Thanks! – c_froehlich Oct 30 '21 at 08:37
  • Should I also remove the primary key, for massive updates? Or just all the other indexes? – dessalines Jan 20 '23 at 18:25
  • 1
    @thouliha The primary key too. – Laurenz Albe Jan 20 '23 at 21:39
  • How do indexes hurt perform if the affected columns are unindexed?, it doesn't make sense to me @LaurenzAlbe – AguThadeus Apr 26 '23 at 17:19
  • 1
    @AguThadeus Because of the way multiversioning is implemented. Unless you can make use of [HOT updates](https://www.cybertec-postgresql.com/en/hot-updates-in-postgresql-for-better-performance/), all indexes on the table have to be modified. – Laurenz Albe Apr 27 '23 at 04:11