I am attempting to use Dask to work on a larger-than-memory dataset on my laptop through a Jupyter notebook. The data is stored as many csv files in an Amazon-s3 bucket.
This first cell runs quickly and I can view the Dask dashboard on port 8787 as expected.
from dask.distributed import Client
import dask.dataframe as dd
client = Client()
client
This second cell is executed in 55.1s which seems odd to me since it isn't actually pulling any data.
df = dd.read_csv('s3://*/*/*.csv', assume_missing=True)
This third cell hangs for 11 minutes before I see anything in the Task Stream in the dashboard, but then it works as expected, executing in 13m 3s total.
df['timestamp']=dd.to_datetime(df['timestamp'], unit='ms')
df = df.set_index('timestamp')
df = client.persist(df)
This seems similar in spirit to Dask Distributed client takes to long to initialize in jupyter lab but my client starts fine, and everything does work eventually. Am I missing something obvious? Thanks!