Best practice for loading large dataset and using dask.delayed

Question

I have a csv file of 550,000 rows of text. I read it into a pandas dataframe, loop over it, and perform some operation on it. Here is some sample code:

import pandas as pd

def my_operation(row_str):
   #perform operation on row_str to create new_row_str
   return new_row_str

df = pd.read_csv('path/to/myfile.csv')
results_list = []
for ii in range(df.shape[0]):
   my_new_str = my_operation(df.iloc[ii, 0])
   results_list.append(my_new_str)

I started to implement dask.delayed but after reading the Delayed Best Practices section, I am not sure I am using dask.delayed in the most optimal way for this problem. Here is the same code with dask.delayed:

import pandas as pd
import dask

def my_operation(row_str):
   #perform operation on row_str to create new_row_str
   return new_row_str

df = pd.read_csv('path/to/myfile.csv')
results_list = []
for ii in range(df.shape[0]):
   my_new_str = dask.delayed(my_operation)(df.iloc[ii, 0])
   results_list.append(my_new_str)

results_list = dask.compute(*results_list)

I'm running this on a single machine with 8 cores and was wanting to know if there was a more optimal way to load this large dataset and perform the same operation over each of the rows?

Thanks in advance for your help and let me know what else I can provide!

There definitely is! You should totally look into Vaex, which is kind of the successor to dask, and it's capabilities to convert csv to h5df format which is able to be partially loaded whereas csv has to load it all at once. The workflow in Vaex is based on pandas, so you will feel right at home. https://github.com/vaexio/vaex — Jakob Guldberg Aaes, Sep 17 '20 at 19:39
@JakobGuldbergAaes thank you for the comment. I will be sure to check out Vaex. For now, however, I am constrained to using `dask` for this project. — aclifton, Sep 17 '20 at 20:45

Best practice for loading large dataset and using dask.delayed

0 Answers0