4

I'm new to Dask and thought this would be a simple task. I want to load data from multiple csv files and combine it into one Dask dataframe. in this example, there are 5 csv files with 10,000 rows of data in each. Obviously I want to give the combined dataframe a unique index.

So I did this:

import dask.dataframe as dd

# Define Dask computations
dataframes = [
    dd.read_csv(os.path.join(data_dir, filename)).set_index('Unnamed: 0')
    for filename in os.listdir(data_dir) if filename.endswith('.csv')
]

combined_df = dd.concat(dataframes).reset_index(drop=True)

If I then do combined_df.head().index I get this as expected:

RangeIndex(start=0, stop=5, step=1)

But combined_df.tail().index is not as expected:

RangeIndex(start=3252, stop=3257, step=1)

Further inspection reveals the index values on combined_df consist of 15 separate series of roughly 3256 in length adding up to a total length of 50000. Note that the csv files all contain an index in the first column from 0 to 10000.

What is going on here and how do I get a standard integer index from 0 to 50000 which is the combined total number of rows in all the csv files?

Background

If you need to test the code above, here is a setup script to create some csv files:

import os
import numpy as np
import pandas as pd

# Create 5 large csv files (could be too big to fit all in memory)
shape = (10000, 1000)

data_dir = 'data'
if not os.path.exists(data_dir):
    os.mkdir(data_dir)

for i in range(5):
    filepath = os.path.join(data_dir, f'datafile_{i:02d}.csv')
    if not os.path.exists(filepath):
        data = (i + 1) * np.random.randn(shape[0], shape[1])
        print(f"Array {i} size in memory: {data.nbytes*1e-6:.2f} MB")
        pd.DataFrame(data).to_csv(filepath)

UPDATE:

The same problem seems to occur with this method:

combined_df = dd.read_csv(os.path.join(data_dir, '*.csv'))
print(dd.compute(combined_df.tail().index)[0])
print(dd.compute(combined_df.reset_index(drop=True).tail().index)[0])

RangeIndex(start=3252, stop=3257, step=1)
RangeIndex(start=3252, stop=3257, step=1)

Seems to me reset_index method produces the same index.

Bill
  • 10,323
  • 10
  • 62
  • 85
  • 1
    Ah, I see now in the [documentation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=reset_index#dask.dataframe.DataFrame.reset_index) it says "Note that unlike in pandas, the reset `dask.dataframe` index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g. `index1 = [0, ..., 10], index2 = [0, ...]`). This is due to the inability to statically know the full length of the index." – Bill Apr 24 '20 at 17:19

1 Answers1

8

In the dask version, reset_index performs its task separately (and concurrently) on each partition, so consecutive numbers in index "restart" as some points, actually on the start of each partition.

To circumvent this limitation, you can:

  • Assign a new column filled with 1.
  • Set the index to cumsum() - 1 computed on this column (fortunately, contrary to reset_index, cumsum is computed on the whole DataFrame).

A side effect is that the name of the index is now the name of this new column. If you want to clear it, you have to do it at the partition level, calling map_partitions.

So the whole code can be:

ddf = ddf.assign(idx=1)
ddf = ddf.set_index(ddf.idx.cumsum() - 1)
ddf = ddf.map_partitions(lambda df: df.rename(index = {'idx': None}))

Note that assign(idx=1) is OK, as this apparently single value is broadcasted to the length of the whole DataFrame, so each element in this new column will be set to 1, without my knowledge of how many rows the DataFrame contains. This is one of magnificent features of the underlying Numpy package, which substantially simplifies programming in both Numpy, Pandas and also in dask.

Then you can run: ddf.compute() to see the result.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Interesting. Thanks for the work-around but surely we would like to have a `reset_index` method that works like it does on Pandas? – Bill Apr 24 '20 at 17:18
  • What about first converting from *dask* DataFrame to the *Pandas* version and **then** calling *reset_index()*? This way, the "pandasonic" version will operate on the whole DataFrame, not on individual partitions (as it does in *dask*). – Valdi_Bo Apr 24 '20 at 17:21
  • 1
    My dataframe won't fit in memory. That is why I am using Dask. – Bill Apr 24 '20 at 17:23
  • What is the reason why `ddf = ddf.set_index(np.arange(len(ddf)))` does not work? This is what I would normally do in Pandas to replace the index. (It raises `KeyError`). – Bill Apr 24 '20 at 17:26
  • In the Pandas [documentation for `set_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html) it says you can use "one or more existing columns or arrays (of the correct length)" whereas in the Dask documentation it says only "using an existing column". So I guess in general Dask has a subset of functionality compared to Pandas. – Bill Apr 24 '20 at 17:32
  • I'm afraid that *set_index* is also decomposed into operations on particular partitions. Maybe *dask* attempts to set this index in each partition? And why do you attempt to find another solution when you have one (mine) solution working? – Valdi_Bo Apr 24 '20 at 17:33
  • Try to assign a new column (using *np.arange*) then you will have an **existing column**. Then call *set_index* on this column. But then this solution becomes quite similiar to mine. The only difference is that you generate this new column in one go, whereas I set a column on ones and then calles *cumsum()*. – Valdi_Bo Apr 24 '20 at 17:36
  • Yeah I'm just trying that. Can't seem to get assign working with anything other than a scalar as in your example. To prepare an array you need to know the length so I think your method may be the best. Thanks for your help! – Bill Apr 24 '20 at 17:45