We are using the recommended method defined here for performing "upserts": http://docs.aws.amazon.com/redshift/latest/dg/merge-replacing-existing-rows.html
It is taking almost two minutes to load a file of just 150 rows. Almost all of this time is spent on this delete operation:
delete
from
measurement using measurement_temp
where
measurement.measurement_tag_id = measurement_temp.measurement_tag_id
and measurement.date_time = measurement_temp.date_time
Even if the temp table is empty, it still takes the nearly two minutes for this operation to complete.
It is still this slow after running a full vacuum on all tables.
Both of the columns in the predicate are part of the compound sort key, and the measurement_tag_id is the distribution key, so it's not clear to me why Redshift is taking so long.
The schema for this table looks like this:
create table measurement(
measurement_tag_id integer not null distkey,
date_time timestamp not null,
value_avg decimal(8,3),
value_min decimal(8,3),
value_max decimal(8,3),
value_std_dev decimal(8,3),
failed_qa_rule_id integer,
primary key(measurement_tag_id, date_time),
foreign key (measurement_tag_id) references measurement_tag(measurement_tag_id),
foreign key (failed_qa_rule_id) references qa_rule(qa_rule_id)
)
compound sortkey(measurement_tag_id, date_time);
Here is the query plan for the DELETE:
XN Hash Join DS_DIST_NONE (cost=9.30..3368957513.17 rows=451945 width=6)
Hash Cond: (("outer".date_time = "inner".date_time) AND ("outer".measurement_tag_id = "inner".measurement_tag_id))
-> XN Seq Scan on measurement (cost=0.00..26844282.88 rows=2684428288 width=18)
-> XN Hash (cost=6.20..6.20 rows=620 width=12)
-> XN Seq Scan on measurement_temp (cost=0.00..6.20 rows=620 width=12)
The equivalent SELECT returns almost instantly. Here's its query plan:
explain
select * from measurement as m
join measurement_temp as mt on m.measurement_tag_id = mt.measurement_tag_id and m.date_time = mt.date_time
XN Merge Join DS_DIST_NONE (cost=0.00..40266436.05 rows=451945 width=144)
Merge Cond: (("outer".measurement_tag_id = "inner".measurement_tag_id) AND ("outer".date_time = "inner".date_time))
-> XN Seq Scan on measurement m (cost=0.00..26844282.88 rows=2684428288 width=68)
-> XN Seq Scan on measurement_temp mt (cost=0.00..6.20 rows=620 width=76)
So the DELETE is performing a Hash Join, while the SELECT is using a Merge Join which is much faster.
Any ideas how to speed this up? Most of the time, there isn't anything to delete (and it's still this slow), so I could add a SELECT query first to check whether it's necessary to delete anything, but it is still going to be slow for re-loading over existing data.