I've just begun using dask, and I'm still fundamentally confused how to do simple pandas tasks with multiple threads, or using a cluster.
Let's take pandas.merge()
with dask
dataframes.
import dask.dataframe as dd
df1 = dd.read_csv("file1.csv")
df2 = dd.read_csv("file2.csv")
df3 = dd.merge(df1, df2)
Now, let's say I were to run this on my laptop, with 4 cores. How do I assign 4 threads to this task?
It appears the correct way to do this is:
dask.set_options(get=dask.threaded.get)
df3 = dd.merge(df1, df2).compute()
And this will use as many threads exist (i.e. as many cores with shared memory on your laptop exist, 4)? How do I set the number of threads?
Let's say I am at a facility with 100 cores. How do I submit this in the same manner as one would submit jobs to the cluster with qsub
? (Similar to running tasks on clusters via MPI?)
dask.set_options(get=dask.threaded.get)
df3 = dd.merge(df1, df2).compute