0

I tried to use clickhouse to store 4 billion data, deployed on a single machine, 48-core cpu and 256g memory, mechanical hard disk.

My data has ten columns, and I want to quickly search any column through SQL statements, such as:

select * from table where key='mykeyword'; or select * from table where school='Yale';

I use order by to establish a sort key, order by (key, school, ...) But when I search, only the first field ordered by key has very high performance. When searching for other fields, the query speed is very slow or even memory overflow (the memory allocation is already large enough)

So ask every expert, does clickhouse support such high-performance search for each column index similar to mysql? I also tried to create a secondary index for each column through index, but the performance did not improve.

datagic
  • 25
  • 7
  • clickhouse is not designed for such queries. Try Cassandra. – Denny Crane Jan 02 '22 at 16:58
  • @datagic, what you have to understand about clickhouse is that it is designed to do aggregations/"complex computations" in a very fast way, it stores the data by partition, and orders them by certains columns to optimise storage for the queries. The main bottleneck is that it will always read the full partition in the order provided by the order by clause that you specified at the creation of the table. So if you need just to read and filter the data by doing it on every column you want and don't have a discriminating column that can be used as partition, you don't have the right use case. – arcticless Jan 31 '22 at 14:39
  • but if you manage to correctly split it by partitions and manage to read the less partitions you can in a query it should do the job just fine. Also if you have memory overflow it means you're doing some kind of aggregation on it, like group by/order by for ex. with a setting of distributed_group_by_no_merge=0 which will gather all your data in one node to do the group/sort. in order to avoid that, you could use this setting set at 1, which will do the aggregation by shard, or 2 which can be used in case your data doesn't have collisions between shards. – arcticless Jan 31 '22 at 14:48

1 Answers1

1

You should try to understand how works sparse primary indexes and how exactly right ORDER BY clause in CREATE TABLE help your query performance.

Clickhouse never will works the same way as mysql

Try to use PRIMARY KEY and ORDER BY in CREATE TABLE statement and use fields with low value cardinality on first order in PRIMARY KEY don't try to use ALL

SELECT * ...

it's really antipattern

moreover, maybe secondary data skip index may help you (but i'm not sure) https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes

Slach
  • 1,672
  • 13
  • 21