0

we are trying to scale up Trino queries, and are currently failing. We use Trino to query Iceberg data, into Dask, in a jupyterlab notebook, and we're running on GKE Kubernetes We are using Dask to check Trino performance as using sql client apps returns the first X results and cancels the query. The dask configuration stayed the same throughout testing

We tried used two tables:

  1. a fake one has 1:1 ratio of parent ids to ids, and
  2. a real one with 5000 to 3 ids per parent id (to see how much data skew effects query time)

we are trying to query three columns: id, parentId and date. We are trying three types of queries: (1) grouping by id with min of parent id (2) same but w/o min of parent id; and (3) grouping by parentId with min of parent id

Queries are attached below

We used 4, 8, 16 Trino workers, and different iceberg partitioning strategies (by id and by parent id columns (not together)); in 4, 8 buckets

We see improvement in the fake table query by using more trino workers, but not from partitioning. We dont see improvement in the skewed (real) data.

We are trying to understand why

  1. partitioning does not help the query time,
  2. why querying only the id column on the skewed data does not improve by using more Trino workers
  3. is it correct that more trino workers will not help with highly skewed data that is not partitioned?

Any help would be appreciated

## Query 1
df_merch = dd.read_sql_query(select([
    func.min(cast(generated_table.c.ifc_core_updatetime, DateTime)).label('update_time'),
    generated_table.c.ifc_core_id_long
]).group_by(generated_table.c.ifc_core_id_long), "trino://trino@trino.trino:8080/iceberg", index_col="ifc_core_id_long", npartitions=16) # this will create the sql queries, but not run them. 
df_merch_persisted = df_merch.persist()
wait(df_merch_persisted)


## Query 2
df_merch = dd.read_sql_query(select([
    func.min(cast(generated_table.c.ifc_core_updatetime, DateTime)).label('update_time'),
    func.min(generated_table.c.ifc_core_id_long).label('ifc_core_id_long'),
    generated_table.c.ifc_core_parentid_long
]).group_by(generated_table.c.ifc_core_parentid_long), "trino://trino@trino.trino:8080/iceberg", index_col="ifc_core_parentid_long", npartitions=16) # this will create the sql queries, but not run them. 
df_merch_persisted = df_merch.persist()
wait(df_merch_persisted)

## Query 3
df_merch = dd.read_sql_query(select([
    func.min(cast(generated_table.c.ifc_core_updatetime, DateTime)).label('update_time'),
    func.min(generated_table.c.ifc_core_parentid_long).label('ifc_core_parentid_long'),
    generated_table.c.ifc_core_id_long
]).group_by(generated_table.c.ifc_core_id_long), "trino://trino@trino.trino:8080/iceberg", index_col="ifc_core_id_long", npartitions=16) # this will create the sql queries, but not run them. 
df_merch_persisted = df_merch.persist()
wait(df_merch_persisted)
  • I believe (py)iceberg now fully supports dask and you should be able to get optimisation on all kinds of queries from that https://py.iceberg.apache.org/ – mdurant Aug 17 '23 at 19:00

0 Answers0