1

I have a huge, 20Gb CSV file to process. The structure is following:

   SID             DATETIME            LAT        LON         HEADING  SPEED NAME 
0  0ff58f68b3a1f  2023-06-06 16:47:38  43.589027  3.813129    297.0    7.0  AAAAAA
1  798cd24a678e1  2023-06-06 09:53:34  43.588792  3.812747    226.0   25.0  BBBBB 
3  0ff58f68b3a1f  2023-06-06 16:47:32  43.589069  3.813142    129.0    4.0  AAAAAA 
4  0ff58f68b3a1f  2023-06-06 16:47:33  43.589062  3.813133    217.0    5.0  AAAAAA
  1. I need to group by SID
  2. in each group, order by datetime
  3. apply a function that will do different things:
    • calculate distances
    • duration
    • request additional data for the geo points (via REST requests) The resulting dataset must contain SID and aggregated information for each group (1 line per group)
  4. write out the result dataset

After reading a lot of articles and posts here, my code looks like this:

import dask.dataframe as dd

names = ['SID','DATETIME','LAT','LON','HEADING','SPEED','NAME']
ddf = dd.read_csv(FILE, header=0, names=names)

# sort by datetime
ddf1 = ddf.set_index('DATETIME').reset_index()

# Option 1
ddf2 = ddf1.groupby('SID').apply(getTripStatistics, meta=(None, 'object'))

# Option 2
ddf2 = ddf1.map_partitions(lambda df: df.groupby('SID').apply(getTripStatistics), meta=(None, 'object'))

ddf2.compute().to_csv("./tmp/my_one_file.csv", index=False)

And now questions:

  1. the partitioning: how many partitions after read_csv depends on how big the dataset is and will this pass into the memory?
  2. after groupby, I would expect that each group(by 20Gb dataset I have a lot, thousands of groups) will go into a separate partition, is it correct? should I call repartition after groupby manually? Then I would do map_partition to function for each partition-SID_Group. I didn't find how to do this, "groupby - each group goes into a separate partition"
  3. I see the same result from Options 1 and 2, from the code, is it because I tested with a smaller dataset and it pass into exactly 1 partition?
  4. the slowest point in my code is the getTripStatistics function. All examples I saw for Dask, were with very simple aggregation, no single example with complex, with multiple processing steps functions. Any tips on how to speed this up?

many thanks, hope somebody could give an advise..

regards

UPDATE: I tested with different number of partitions.

  • 1 partition groupby and map_partition - same amount results at the end

    Length: 683, dtype: object groupby apply lambda: 0:00:12.679400

  • 4 partitions after gr2 = gr2.repartition(npartitions=4). i got different amount of results back

    #map_partitions Length: 811, dtype: object groupby apply lambda: 0:00:14.047789

    #groupby-apply Length: 683, dtype: object groupby apply lambda: 0:00:13.530857

.. what means, that groupby-apply uses one partition dataset, and the rest will be ignored? is this correct? ..or I understood this incorrectly?

1 Answers1

-1

map_partitions is doing the work per input partition, whatever size that may be. That means that the number of outputs depends on to what degree the groups are ordered through the whole of your data, and is probably not what you want unless you have pre-partitioned your data somehow.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Hmm.. i just doing groupby('SID'), where each group can contain then a different amount of rows.. why I get different amounts of results at the end? bei 1 partition 683, bei 4 partitions - 811 – user2894156 Jul 14 '23 at 08:54
  • You are getting the total number of groups. You get a higher number where groups repeat between partitions. – mdurant Jul 14 '23 at 13:09