7

I have two tables

event

  • id
  • os

params

  • id
  • sx
  • sy

This table have relation 1-1 by id. If execute query

select count(*)
from
(select id from event where os like 'Android%')
inner join
(select id from params where sx >= 1024)
using id

they very slow

But if all data contains in one table

select count(*) from event where sx >= 1024 and os like 'Android%'

Query executed very fast.

Please, tell me how use join in ClickHouse DB effective? Keep all data in one table is not convenient.

Oleg Khamov
  • 269
  • 1
  • 3
  • 10

2 Answers2

5

I experience same problems with joining 2 huge distributed tables. There are 2 main problems

  • durion of executing
  • limits by needed memory for a query.

What works for me is sharding calculation query by id%N using subqueries and union all results then.

SELECT count(*)
FROM
(
    SELECT 1
    FROM event
    WHERE id%2=0 AND id IN
    (
        SELECT id
        FROM params
        WHERE id % 2 = 0 AND sx >= 1024
    )
    UNION ALL
    SELECT 2
    FROM event
    WHERE id % 2 = 1 AND id IN
    (
        SELECT id
        FROM params
        WHERE id % 2 = 1 AND sx >= 1024
    )
)

You can change id%N(2 in the example) until you get needed performance. Need to replace IN to GLOBAL IN if you use distributed engines for tables.

3

You may rewrite query like this:

select count(*)
from event 
where os like 'Android%' 
AND id IN (select id from params where sx >= 1024)
uYSIZfoz
  • 911
  • 6
  • 7