we are trying to scale up Trino queries, and are currently failing. We use Trino to query Iceberg data, into Dask, in a jupyterlab notebook, and we're running on GKE Kubernetes We are using Dask to check Trino performance as using sql client apps returns the first X results and cancels the query. The dask configuration stayed the same throughout testing
We tried used two tables:
- a fake one has 1:1 ratio of parent ids to ids, and
- a real one with 5000 to 3 ids per parent id (to see how much data skew effects query time)
we are trying to query three columns: id, parentId and date. We are trying three types of queries: (1) grouping by id with min of parent id (2) same but w/o min of parent id; and (3) grouping by parentId with min of parent id
Queries are attached below
We used 4, 8, 16 Trino workers, and different iceberg partitioning strategies (by id and by parent id columns (not together)); in 4, 8 buckets
We see improvement in the fake table query by using more trino workers, but not from partitioning. We dont see improvement in the skewed (real) data.
We are trying to understand why
- partitioning does not help the query time,
- why querying only the id column on the skewed data does not improve by using more Trino workers
- is it correct that more trino workers will not help with highly skewed data that is not partitioned?
Any help would be appreciated
## Query 1
df_merch = dd.read_sql_query(select([
func.min(cast(generated_table.c.ifc_core_updatetime, DateTime)).label('update_time'),
generated_table.c.ifc_core_id_long
]).group_by(generated_table.c.ifc_core_id_long), "trino://trino@trino.trino:8080/iceberg", index_col="ifc_core_id_long", npartitions=16) # this will create the sql queries, but not run them.
df_merch_persisted = df_merch.persist()
wait(df_merch_persisted)
## Query 2
df_merch = dd.read_sql_query(select([
func.min(cast(generated_table.c.ifc_core_updatetime, DateTime)).label('update_time'),
func.min(generated_table.c.ifc_core_id_long).label('ifc_core_id_long'),
generated_table.c.ifc_core_parentid_long
]).group_by(generated_table.c.ifc_core_parentid_long), "trino://trino@trino.trino:8080/iceberg", index_col="ifc_core_parentid_long", npartitions=16) # this will create the sql queries, but not run them.
df_merch_persisted = df_merch.persist()
wait(df_merch_persisted)
## Query 3
df_merch = dd.read_sql_query(select([
func.min(cast(generated_table.c.ifc_core_updatetime, DateTime)).label('update_time'),
func.min(generated_table.c.ifc_core_parentid_long).label('ifc_core_parentid_long'),
generated_table.c.ifc_core_id_long
]).group_by(generated_table.c.ifc_core_id_long), "trino://trino@trino.trino:8080/iceberg", index_col="ifc_core_id_long", npartitions=16) # this will create the sql queries, but not run them.
df_merch_persisted = df_merch.persist()
wait(df_merch_persisted)