How can I ensure the mutations I submit to CH actually run?
I have a facts table lets call it Table A. It contains 8'904'390 records.
I also have a metadata table, Table B. It contains 14'790'739 records.
I am interested in joining them to make a faster table. Therefore I create a new table, A + B. It has selected columns from A and selected columns from B.
When I populate the new table with historical data, I run out of memory.
Executing a simple INSERT INTO
statement similar to this one:
INSERT INTO A+B SELECT
id,
time,
price,
buyer.gender as buyer_gender
FROM
(
SELECT
id,
time,
buyer_ref,
price
FROM A
)
LEFT JOIN
(
SELECT
ref,
gender
FROM B
) AS buyer ON buyer_ref = ref
So instead of manually partition the data to insert, I found this clever solution:
Updating rows with values from another table in ClickHouse
The idea is to first populate the A+B table with only data from A, then update it with data from B.
I tried running
ALTER TABLE A+B UPDATE buyer_gender=joinGet('buyer_join', 'gender', buyer_ref) where buyer_gender=''
The query executed and a record was added to the system.mutations
table.
I went home and came back 24 hours later and it had not finished, is_done was still 0 and the A+B table did not have any data from B in the buyer_gender
column.
The clickhouse.logs are pretty quiet:
2021.02.12 11:16:03.631674 [ 153 ] {0e304bff-8563-4219-baf9-27294a84feac} <Debug> executeQuery: (from 172.22.0.1:52862) ALTER TABLE A+B UPDATE buyer_gender=joinGet('buyer_join', 'gender', buyer_id) where gender=''
2021.02.12 11:16:03.641182 [ 153 ] {0e304bff-8563-4219-baf9-27294a84feac} <Information> db_name.A+B: Added mutation: mutation_216.txt
2021.02.12 11:16:03.641210 [ 153 ] {0e304bff-8563-4219-baf9-27294a84feac} <Information> db_name.A+B: Waiting mutation: mutation_216.txt
2021.02.12 11:17:35.873119 [ 154 ] {14b270a0-281a-4d0a-b3be-4ce5059d853d} <Debug> executeQuery: (from 172.22.0.1:36140) select * from system.mutations
2021.02.12 11:17:35.873626 [ 154 ] {14b270a0-281a-4d0a-b3be-4ce5059d853d} <Trace> ContextAccess (default): Access granted: SELECT(database, table, mutation_id, command, create_time, `block_numbers.partition_id`, `block_numbers.number`, parts_to_do_names, parts_to_do, is_done, latest_failed_part, latest_fail_time, latest_fail_reason) ON system.mutations
2021.02.12 11:17:39.557393 [ 154 ] {fdabc8d1-488c-4b47-a35e-afd80f288fa6} <Debug> executeQuery: (from 172.22.0.1:36140) select * from system.mutations Format Vertical
2021.02.12 11:17:39.557868 [ 154 ] {fdabc8d1-488c-4b47-a35e-afd80f288fa6} <Trace> ContextAccess (default): Access granted: SELECT(database, table, mutation_id, command, create_time, `block_numbers.partition_id`, `block_numbers.number`, parts_to_do_names, parts_to_do, is_done, latest_failed_part, latest_fail_time, latest_fail_reason) ON system.mutations
And the clickhouse.error.logs contain nothing related to this mutation.
I tried only selecting from a query on table A with one column from the join table using the joinGet
above. It takes around 10s.
I tried an ALTER TABLE UPDATE
statement where I assign the buyer_gender
column value to a constant string 'foo'.
It also executes very quickly. Neither operation seems to use significant memory. But combining the joinGet
with the ALTER TABLE UPDATE
seems to never finish and I have no idea if it is even started at all.
So my question is how does one trigger mutations? If that is not possible, is over 24 hours normal for this operation and volume of data?
ClickHouse server version 20.6.5 revision 54436