I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python
import dask.dataframe as dd
dask_df = dd.read_csv(fullPath)
............
for index , row in uniqueURLs.iterrows():
print(index);
results = dask_df[dask_df['URL'] == row['URL']]
count = results.size.compute();
But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute(). So If I removed the line that computes the size of results my program turns to be very fast. Can someone explain this? How can I make it faster?