10

When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?

For example, using pandas:

df = pandas.read_csv(filename, index_col=0)

Ideally using dask could this be:

df = dask.dataframe.read_csv(filename, index_col=0)

I have tried

df = dask.dataframe.read_csv(filename).set_index(?)

but the index column does not have a name (and this seems slow).

Jaydog
  • 552
  • 2
  • 6
  • 22
  • 1
    the documentation seems to indicate that `df = dask.dataframe.read_csv(filename, index_col=0)` should work as the `kwargs` are passed to `pandas`, did you try this? – EdChum Sep 12 '17 at 10:57
  • 1
    I did try and it failed with the error highlighted by MRocklin below, i.e. `ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead` – Jaydog Sep 12 '17 at 12:50

3 Answers3

7

No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.

In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead

But this won't be any slower or faster than doing it the other way.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • won't this be much slower on partitioned data? if each csv is a partition on the index, nothing would need to be run, however, set_index would force a full execution on the entire dataset just to build the index's meta data. – Stijn Tallon Mar 09 '20 at 09:38
  • If you know the partitions ahead of time then you can specify them in the `set_index` call. If you don't know them then we'll have to read the files anyway. If they are already sorted then Dask will do this in a single read. – MRocklin Mar 20 '20 at 00:28
  • @MRocklin Is there an option for continous `time series` in csv files to let `dask` before the read that these are already `sorted timestamps`? That would allow it to dask to get the first and last row of the csv, knowing the time ranges for the partitions before their creation . Would that help preventing shuffling during the `set_index` method? – gies0r Jun 12 '20 at 11:11
  • [Ok just reading through..](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.set_index), the answer is `ddf.set_index('col', sorted=True)`. Nice. – gies0r Jun 12 '20 at 11:17
2

I know I'm a bit late, but this is the first result on google so it should get answered.

If you write your dataframe with:

# index = True is default
my_pandas_df.to_csv('path')

#so this is same
my_pandas_df.to_csv('path', index=True)

And import with Dask:

import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')

It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).

How to figure it out:

my_dask_df = dd.read_csv('path')
my_dask_df.columns

which returns

Index(['Unnamed: 0', 'col 0', 'col 1',
       ...
       'col n'],
      dtype='object', length=...)
E. Bassett
  • 166
  • 7
2

Now you can write: df = pandas.read_csv(filename, index_col='column_name') (Where column name is the name of the column you want to set as the index).

Alain Bianchini
  • 3,883
  • 1
  • 7
  • 27
Sunil
  • 21
  • 3