I am trying to understand the use patterns for Dask on a local machine.
Specifically,
- I have a dataset that fits in memory
- I'd like to do some pandas operations
- groupby...
- date parsing
- etc.
Pandas performs these operations via a single core and these operations are taking hours for me. I have 8 cores on my machine and, as such, I'd like to use Dask to parallelize these operations as best as possible.
My question is as follows: What is the difference between the two way of doing this in Dask:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
(1)
import dask.dataframe as dd
df = dd.from_pandas(
pd.DataFrame(iris.data, columns=iris.feature_names),
npartitions=2
)
df.mean().compute()
(2)
import dask.dataframe as dd
from distributed import Client
client = Client()
df = client.persist(
dd.from_pandas(
pd.DataFrame(iris.data, columns=iris.feature_names),
npartitions=2
)
)
df.mean().compute()
What is the benefit of one use pattern over the other? Why should I use one over the other?