6

Is it possible with Clickhouse to have result containing a pair of array transformed as columns?

Form this result:

┌─f1──┬f2───────┬f3─────────────┐
│ 'a' │ [1,2,3] │ ['x','y','z'] │
│ 'b' │ [4,5,6] │ ['x','y','z'] │
└─────┴─────────┴───────────────┘

to :

┌─f1──┬x──┬y──┬z──┐
│ 'a' │ 1 │ 2 │ 3 │
│ 'b' │ 4 │ 5 │ 6 │
└─────┴───┴───┴───┘

The idea is to not have to repeat the header values for each line.

In my case, the "header" array f3 unique by queries and join to the f1,f2.

crak
  • 1,635
  • 2
  • 17
  • 33

2 Answers2

9

You can do it with help of indexOf function.

SELECT *
FROM test_sof 

┌─f1─┬─f2──────┬─f3────────────┐
│ a  │ [1,2,3] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
┌─f1─┬─f2────────┬─f3────────────────┐
│ c  │ [7,8,9,0] │ ['x','y','z','n'] │
└────┴───────────┴───────────────────┘
┌─f1─┬─f2─────────┬─f3────────────────┐
│ d  │ [7,8,9,11] │ ['x','y','z','n'] │
└────┴────────────┴───────────────────┘
┌─f1─┬─f2──────┬─f3────────────┐
│ b  │ [4,5,6] │ ['x','y','z'] │
└────┴─────────┴───────────────┘

4 rows in set. Elapsed: 0.001 sec.

Then:

SELECT 
    f1, 
    f2[indexOf(f3, 'x')] AS x, 
    f2[indexOf(f3, 'y')] AS y, 
    f2[indexOf(f3, 'z')] AS z, 
    f2[indexOf(f3, 'n')] AS n
FROM test_sof 
ORDER BY 
    f1 ASC, 
    x ASC

┌─f1─┬─x─┬─y─┬─z─┬──n─┐
│ a  │ 1 │ 2 │ 3 │  0 │
│ b  │ 4 │ 5 │ 6 │  0 │
│ c  │ 7 │ 8 │ 9 │  0 │
│ d  │ 7 │ 8 │ 9 │ 11 │
└────┴───┴───┴───┴────┘

4 rows in set. Elapsed: 0.002 sec. 

Keep in mind situation when index from header array will not be present in data array or vise-versa.

UPD: the way how to get data without knowing "headers".

You will get three columns, third one with headers.

SELECT 
    f1, 
    f2[num] AS f2_el, 
    f3[num] AS f3_el
FROM test_sof 
ARRAY JOIN arrayEnumerate(f2) AS num
ORDER BY f1 ASC

┌─f1─┬─f2_el─┬─f3_el─┐
│ a  │     1 │ x     │
│ a  │     2 │ y     │
│ a  │     3 │ z     │
│ b  │     4 │ x     │
│ b  │     5 │ y     │
│ b  │     6 │ z     │
│ c  │     7 │ x     │
│ c  │     8 │ y     │
│ c  │     9 │ z     │
│ c  │     0 │ n     │
│ d  │     7 │ x     │
│ d  │     8 │ y     │
│ d  │     9 │ z     │
│ d  │    11 │ n     │
└────┴───────┴───────┘

14 rows in set. Elapsed: 0.006 sec.
Victor Perov
  • 1,697
  • 18
  • 37
  • It's the solution I have implemented, but In my case, it requires a first request to find the array headers. But since I have the array I can avoid indexOf and use directly the index. Despite it's a working solution, I think it's not the ideal "full in clickhouse" soluton. – crak Feb 25 '19 at 12:59
  • okay, I got your question now. I would like update my answer a bit later, because I remembered one more way – Victor Perov Feb 25 '19 at 13:13
  • But if you want split your "headers" as columns in resultset, you have to know all of them, otherwise it is impossible. The way, how to do it without requesting headers will return headers as values of the "header row" – Victor Perov Feb 25 '19 at 15:09
  • "It is impossible" is a valid answer. – crak Feb 26 '19 at 14:01
0

This a fun puzzle. As pointed out already the indexOf() function seems to be the best way to pivot array columns inside ClickHouse but requires explicit selection of array positions. If you are using Python and your result set is not absurdly large, you can solve the problem in a more general way by flipping the array values into rows in SQL, then pivoting columns f2 and f3 in Python. Here's how it works.

First, use clickHouse-sqlalchemy and pandas to expand the matching arrays into rows as follows. (This example uses Jupyter Notebook running on Anaconda.)

# Load SQL Alchemy and connect to ClickHouse
from sqlalchemy import create_engine
%load_ext sql
%sql clickhouse://default:@localhost/default

# Use JOIN ARRAY to flip corresponding positions in f2, f3 to rows.
result = %sql select * from f array join f2, f3
df = result.DataFrame()
print(df)

The data frame appears as follows:

  f1  f2 f3
0  a   1  x
1  a   2  y
2  a   3  z
3  b   4  x
4  b   5  y
5  b   6  z

Now we can pivot f2 and f3 into a new data frame.

dfp = df.pivot(columns='f3', values='f2', index='f1')
print(dfp)

The new dataframe dfp appears as follows:

f3  x  y  z
f1         
a   1  2  3
b   4  5  6

This solution requires you to work outside the database but has the advantage that it works generally for any set of arrays as long as the names and values match. For instance if we add another row with different values and properties the same code gets the right answer. Here's a new row.

insert into f values ('c', [7,8,9,10], ['x', 'y', 'aa', 'bb'])

The pivoted data frame will appear as follows. NaN corresponds to missing values.

f3   aa    bb    x    y    z
f1                          
a   NaN   NaN  1.0  2.0  3.0
b   NaN   NaN  4.0  5.0  6.0
c   9.0  10.0  7.0  8.0  NaN

For more information on this solution see https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html and https://github.com/xzkostyan/clickhouse-sqlalchemy.

R Hodges
  • 326
  • 1
  • 2