Divide ClickHouse table into distinct logical chunks for parallel reading

Question

I want to read a ClickHouse table in parallel from multiple clients. The table is unpartitioned and uses the MergeTree engine. My application is in C++ and is communicating with ClickHouse over its C++ client through SQL queries. The overall idea is to have multiple threads that send different queries to ClickHouse. Each query should get a distinct subset of the table.

I have tried using row_number() OVER() and rowNumberInAllBlocks() but they lead to non-deterministic results throughout different invocations. For example, I tried to get the max row number through SELECT MAX(rowNumberInAllBlocks()) from my_table, and then dividing it into multiple bounds, one for each thread. After defining the bounds, each thread sends a query like:

SELECT rowNumberInAllBlocks() as row_no,* FROM my_table WHERE row_no >= $lower_bound AND row_no < $upper_bound

However, it seems that ClickHouse assigns each tuple to a different row number each time, and hence I can not use this method to deterministically divide my table into different distinct subsets. From the ClickHouse documentation I read:

rowNumberInAllBlocks() Returns the ordinal number of the row in the data block. This function only considers the affected data blocks.

What does "affected data blocks" mean in this case? In PostgreSQL the desired functionality works well with the system column ctid. Is there any way to achieve the desired behavior with ClickHouse?

Good catch, but I believe this is really ClickHouse specific... — harry_g, May 30 '23 at 16:13
Clickhouse is OLAP DB. You should not want to read ClickHouse table in parallel from multiple clients — Denny Crane, Jun 07 '23 at 19:00
Thanks a lot for the answer. I have an ETL scenario. Let's say I want to read my clicklogs from ClickHouse into Tensorflow to train a model. I want to extract the data as fast as possible, hence the multithreaded client. Any suggestions for that? — harry_g, Jun 07 '23 at 21:59

Divide ClickHouse table into distinct logical chunks for parallel reading

0 Answers0