I want to read a ClickHouse table in parallel from multiple clients. The table is unpartitioned and uses the MergeTree engine. My application is in C++ and is communicating with ClickHouse over its C++ client through SQL queries. The overall idea is to have multiple threads that send different queries to ClickHouse. Each query should get a distinct subset of the table.
I have tried using row_number() OVER()
and rowNumberInAllBlocks()
but they lead to non-deterministic results throughout different invocations. For example, I tried to get the max row number through SELECT MAX(rowNumberInAllBlocks()) from my_table
, and then dividing it into multiple bounds, one for each thread. After defining the bounds, each thread sends a query like:
SELECT rowNumberInAllBlocks() as row_no,* FROM my_table WHERE row_no >= $lower_bound AND row_no < $upper_bound
However, it seems that ClickHouse assigns each tuple to a different row number each time, and hence I can not use this method to deterministically divide my table into different distinct subsets. From the ClickHouse documentation I read:
rowNumberInAllBlocks() Returns the ordinal number of the row in the data block. This function only considers the affected data blocks.
What does "affected data blocks" mean in this case? In PostgreSQL the desired functionality works well with the system column ctid
. Is there any way to achieve the desired behavior with ClickHouse?