How can I see the data preview of Dask DataFrame?

Question

I created Dask DataFrame from Pandas DataFrame and applied few functions on it. When I'm trying to view the data using

 df.head()

it is taking too much time. How can I view the dataframe?

df.head(n=10, npartitions=1, compute=True), no luck with this as well — Hari, Mar 23 '17 at 12:46

score 3 · Answer 1 · answered Mar 23 '17 at 12:53

It really depends on what computations are behind your dataframe.

The df.head() command executes only those operations necessary to get a few lines of data from the dataframe. Often this is very fast. For example if we are reading a large dataframe from a Parquet or CSV file then we only need to load in the first chunk of data to get the first few rows.

df = dd.read_csv('...')
df.head()  # this is relatively fast

However if our dataframe is more complex, maybe it is the result of a lazy shuffle or set_index operation, then we might genuinely need to read and process all of our data before we can get the first few rows.

df = df.set_index('some-column')
df = df.merge(some_other_df)
df.head()  # this is slow, because it has to do the set_index and merge

You can always see metadata cheaply (column names, types, number of tasks and partitions).

>>> df
Dask DataFrame Structure:
                       close     high      low     open
npartitions=505                                        
2008-01-02 09:00:00  float64  float64  float64  float64
2008-01-03 09:00:00      ...      ...      ...      ...
...                      ...      ...      ...      ...
2009-12-31 09:00:00      ...      ...      ...      ...
2009-12-31 16:00:00      ...      ...      ...      ...
Dask Name: from-delayed, 1010 tasks

Persist

If your data fits in RAM (or distributed RAM if you're on a cluster) then you should also persist to memory. This will make things very fast.

df = df.persist()

However if you don't have enough RAM then this may slow down your machine.

Here is my code : Load the data into pandas from sql server dfd = dd.from_pandas(df,npartitions = 100) names = dfd['name'].drop_duplicates().compute() dfd['found'] = dfd['newcol'].apply(lambda x: any(name in x for name in names),meta=('x','str')).astype(int) dfd Dask dataframe structure : I'm seeing npartitions =1 ,I'm trying to understand — Hari, Mar 23 '17 at 15:25

How can I see the data preview of Dask DataFrame?

1 Answers1

Persist