I have a huge, 20Gb CSV file to process. The structure is following:
SID DATETIME LAT LON HEADING SPEED NAME
0 0ff58f68b3a1f 2023-06-06 16:47:38 43.589027 3.813129 297.0 7.0 AAAAAA
1 798cd24a678e1 2023-06-06 09:53:34 43.588792 3.812747 226.0 25.0 BBBBB
3 0ff58f68b3a1f 2023-06-06 16:47:32 43.589069 3.813142 129.0 4.0 AAAAAA
4 0ff58f68b3a1f 2023-06-06 16:47:33 43.589062 3.813133 217.0 5.0 AAAAAA
- I need to group by SID
- in each group, order by datetime
- apply a function that will do different things:
- calculate distances
- duration
- request additional data for the geo points (via REST requests) The resulting dataset must contain SID and aggregated information for each group (1 line per group)
- write out the result dataset
After reading a lot of articles and posts here, my code looks like this:
import dask.dataframe as dd
names = ['SID','DATETIME','LAT','LON','HEADING','SPEED','NAME']
ddf = dd.read_csv(FILE, header=0, names=names)
# sort by datetime
ddf1 = ddf.set_index('DATETIME').reset_index()
# Option 1
ddf2 = ddf1.groupby('SID').apply(getTripStatistics, meta=(None, 'object'))
# Option 2
ddf2 = ddf1.map_partitions(lambda df: df.groupby('SID').apply(getTripStatistics), meta=(None, 'object'))
ddf2.compute().to_csv("./tmp/my_one_file.csv", index=False)
And now questions:
- the partitioning: how many partitions after read_csv depends on how big the dataset is and will this pass into the memory?
- after groupby, I would expect that each group(by 20Gb dataset I have a lot, thousands of groups) will go into a separate partition, is it correct? should I call repartition after groupby manually? Then I would do map_partition to function for each partition-SID_Group. I didn't find how to do this, "groupby - each group goes into a separate partition"
- I see the same result from Options 1 and 2, from the code, is it because I tested with a smaller dataset and it pass into exactly 1 partition?
- the slowest point in my code is the getTripStatistics function. All examples I saw for Dask, were with very simple aggregation, no single example with complex, with multiple processing steps functions. Any tips on how to speed this up?
many thanks, hope somebody could give an advise..
regards
UPDATE: I tested with different number of partitions.
1 partition groupby and map_partition - same amount results at the end
Length: 683, dtype: object groupby apply lambda: 0:00:12.679400
4 partitions after gr2 = gr2.repartition(npartitions=4). i got different amount of results back
#map_partitions Length: 811, dtype: object groupby apply lambda: 0:00:14.047789
#groupby-apply Length: 683, dtype: object groupby apply lambda: 0:00:13.530857
.. what means, that groupby-apply uses one partition dataset, and the rest will be ignored? is this correct? ..or I understood this incorrectly?