How to implement SQL Groupby in RAPIDS

Question

I'm seeking to translate an SQL query to use RAPIDS. Consider the simplified query below:

(SELECT min(a), max(b), c
FROM T
GROUP BY c) AS result

I have validated the code below, but is this the optimal solution? Is sorting on the group key necessary? Is there a cleaner / more idiomatic way to write it?

from pygdf import DataFrame as gdf

T = gdf(...)
df = gdf({'a':T.a, 'c':T.c}).groupby('c').min().sort_values(by='c')
df['max_b'] = gdf({'b':T.b, 'c':T.c}).groupby('c').max().sort_values(by='c').max_b
result = gdf({'a': df.min_a, 'b': df.max_b, 'c':df.c})

score 3 · Accepted Answer · answered Nov 23 '18 at 20:35

You can rewrite your aggregation using the .agg function to make it more straightforward:

from pygdf import DataFrame as gdf

T = gdf(...)
df = gdf({'a':T.a, 'b': T.b, 'c':T.c}).groupby('c').agg({'a': 'min', 'b': 'max'})
result = gdf({'a': df.min_a, 'b': df.max_b, 'c':df.c})

score 1 · Answer 2 · answered Jul 22 '19 at 19:17

You can use BlazingSQL which is a SQL engine built on top of RAPIDS. Full disclosure, I work for BlazingSQL.

from blazingsql import BlazingContext
bc = BlazingContext()

# Create Table from GDF
bc.create_table('myTableName', gdf)

# Query
result = bc.sql('SELECT min(a), max(b), c FROM main.myTableName GROUP BY c').get()
result_gdf = result.columns

#Print GDF 
print(result_gdf)

How to implement SQL Groupby in RAPIDS

2 Answers2