6

I created Dask DataFrame from Pandas DataFrame and applied few functions on it. When I'm trying to view the data using

 df.head()

it is taking too much time. How can I view the dataframe?

ZygD
  • 22,092
  • 39
  • 79
  • 102
Hari
  • 173
  • 1
  • 3
  • 11

1 Answers1

3

It really depends on what computations are behind your dataframe.

The df.head() command executes only those operations necessary to get a few lines of data from the dataframe. Often this is very fast. For example if we are reading a large dataframe from a Parquet or CSV file then we only need to load in the first chunk of data to get the first few rows.

df = dd.read_csv('...')
df.head()  # this is relatively fast

However if our dataframe is more complex, maybe it is the result of a lazy shuffle or set_index operation, then we might genuinely need to read and process all of our data before we can get the first few rows.

df = df.set_index('some-column')
df = df.merge(some_other_df)
df.head()  # this is slow, because it has to do the set_index and merge

You can always see metadata cheaply (column names, types, number of tasks and partitions).

>>> df
Dask DataFrame Structure:
                       close     high      low     open
npartitions=505                                        
2008-01-02 09:00:00  float64  float64  float64  float64
2008-01-03 09:00:00      ...      ...      ...      ...
...                      ...      ...      ...      ...
2009-12-31 09:00:00      ...      ...      ...      ...
2009-12-31 16:00:00      ...      ...      ...      ...
Dask Name: from-delayed, 1010 tasks

Persist

If your data fits in RAM (or distributed RAM if you're on a cluster) then you should also persist to memory. This will make things very fast.

df = df.persist()

However if you don't have enough RAM then this may slow down your machine.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Here is my code : Load the data into pandas from sql server dfd = dd.from_pandas(df,npartitions = 100) names = dfd['name'].drop_duplicates().compute() dfd['found'] = dfd['newcol'].apply(lambda x: any(name in x for name in names),meta=('x','str')).astype(int) dfd Dask dataframe structure : I'm seeing npartitions =1 ,I'm trying to understand – Hari Mar 23 '17 at 15:25