0

I found strange thing, the query:

SELECT *
FROM progress as pp
ALL LEFT JOIN links as ll USING (viewId)
WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8' 

Result: 0 rows in set. Elapsed: 5.267 sec. Processed 8.62 million rows, 484.94 MB (1.64 million rows/s., 92.08 MB/s.)

Here modified query:

SELECT *
FROM
  (SELECT *
   FROM progress
   WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8') AS p ALL
LEFT JOIN
  (SELECT *
   FROM links
   WHERE viewId = toUUID('a776a2f2-16ad-448a-858d-891e68bec9a8')) AS l ON p.viewId = l.viewId;

Result : 0 rows in set. Elapsed: 0.076 sec. Processed 4.48 million rows, 161.35 MB (58.69 million rows/s., 2.12 GB/s.)

But it looks dirty.

Isn't it supposed to optimize the query considering where condition?

What is the right way to write the query here, and what if it will be where in?

Then I try to add another join:

SELECT *
FROM
  (SELECT videoUuid AS contentUuid,
          viewId
   FROM
     (SELECT *
      FROM progress
      WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8') p ALL
   LEFT JOIN
     (SELECT *
      FROM links
      WHERE viewId = toUUID('a776a2f2-16ad-448a-858d-891e68bec9a8')) USING `viewId`) ALL
LEFT JOIN `metaInfo` USING `viewId`,
                           `contentUuid`;

The result again very slow, considering that I want just join 3 tables with condition selection one row:

0 rows in set. Elapsed: 1.747 sec. Processed 9.13 million rows, 726.55 MB (5.22 million rows/s., 415.85 MB/s.)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
ogbofjnr
  • 1,688
  • 5
  • 19
  • 41

3 Answers3

6

At this moment the CH not very good cope with multi-joins queries (DB star-schema) and the query optimizer not good enough to rely on it completely.

So it needs to explicitly say how to 'execute' a query by using subqueries instead of joins.

Consider the test query:

SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
  INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
  INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
  INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
WHERE r = 54320
/*
┌─────r─┐
│ 54320 │
└───────┘

1 rows in set. Elapsed: 6.261 sec. Processed 96.06 million rows, 768.52 MB (15.34 million rows/s., 122.74 MB/s.)
*/

Let's rewrite it using subqueries to significantly speed it up.

SELECT number AS r
FROM numbers(87654321)
WHERE r = 54320 AND number IN (
  SELECT number AS r
  FROM numbers(7654321)
  WHERE r = 54320 AND number IN (
    SELECT number AS r
    FROM numbers(654321)
    WHERE r = 54320 AND number IN (
      SELECT number AS r
      FROM numbers(54321)
      WHERE r = 54320
    )
  )
)
/*
┌─────r─┐
│ 54320 │
└───────┘

1 rows in set. Elapsed: 0.481 sec. Processed 96.06 million rows, 768.52 MB (199.69 million rows/s., 1.60 GB/s.)
*/

There are other ways to optimize JOIN:


Some useful refs:

Altinity webinar: Tips and tricks every ClickHouse user should know

Altinity webinar: Secrets of ClickHouse Query Performance

vladimir
  • 13,428
  • 2
  • 44
  • 70
0

Isn't it supposed to optimize the query concidering where condition?

Such optimization is not implemented yet

Denny Crane
  • 11,574
  • 2
  • 19
  • 30
0

It is expected behavior. According to CH doc https://clickhouse.tech/docs/en/sql-reference/statements/select/join/#performance "When running a JOIN, there is no optimization of the order of execution in relation to other stages of the query. The join (a search in the right table) is run before filtering in WHERE and before aggregation."