I have a csv
file of 550,000 rows of text. I read it into a pandas dataframe, loop over it, and perform some operation on it. Here is some sample code:
import pandas as pd
def my_operation(row_str):
#perform operation on row_str to create new_row_str
return new_row_str
df = pd.read_csv('path/to/myfile.csv')
results_list = []
for ii in range(df.shape[0]):
my_new_str = my_operation(df.iloc[ii, 0])
results_list.append(my_new_str)
I started to implement dask.delayed
but after reading the Delayed Best Practices section, I am not sure I am using dask.delayed
in the most optimal way for this problem. Here is the same code with dask.delayed
:
import pandas as pd
import dask
def my_operation(row_str):
#perform operation on row_str to create new_row_str
return new_row_str
df = pd.read_csv('path/to/myfile.csv')
results_list = []
for ii in range(df.shape[0]):
my_new_str = dask.delayed(my_operation)(df.iloc[ii, 0])
results_list.append(my_new_str)
results_list = dask.compute(*results_list)
I'm running this on a single machine with 8 cores and was wanting to know if there was a more optimal way to load this large dataset and perform the same operation over each of the rows?
Thanks in advance for your help and let me know what else I can provide!