Can I set the index column when reading a CSV using Python dask?

Question

When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?

For example, using pandas:

df = pandas.read_csv(filename, index_col=0)

Ideally using dask could this be:

df = dask.dataframe.read_csv(filename, index_col=0)

I have tried

df = dask.dataframe.read_csv(filename).set_index(?)

but the index column does not have a name (and this seems slow).

the documentation seems to indicate that `df = dask.dataframe.read_csv(filename, index_col=0)` should work as the `kwargs` are passed to `pandas`, did you try this? — EdChum, Sep 12 '17 at 10:57
I did try and it failed with the error highlighted by MRocklin below, i.e. `ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead` — Jaydog, Sep 12 '17 at 12:50

score 7 · Answer 1 · answered Sep 12 '17 at 11:53

7

No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.

In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead

But this won't be any slower or faster than doing it the other way.

answered Sep 12 '17 at 11:53

MRocklin

55,641
23
163
235

won't this be much slower on partitioned data? if each csv is a partition on the index, nothing would need to be run, however, set_index would force a full execution on the entire dataset just to build the index's meta data. – Stijn Tallon Mar 09 '20 at 09:38
If you know the partitions ahead of time then you can specify them in the `set_index` call. If you don't know them then we'll have to read the files anyway. If they are already sorted then Dask will do this in a single read. – MRocklin Mar 20 '20 at 00:28
@MRocklin Is there an option for continous `time series` in csv files to let `dask` before the read that these are already `sorted timestamps`? That would allow it to dask to get the first and last row of the csv, knowing the time ranges for the partitions before their creation . Would that help preventing shuffling during the `set_index` method? – gies0r Jun 12 '20 at 11:11
[Ok just reading through..](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.set_index), the answer is `ddf.set_index('col', sorted=True)`. Nice. – gies0r Jun 12 '20 at 11:17

score 2 · Answer 2 · answered Mar 15 '20 at 16:12

I know I'm a bit late, but this is the first result on google so it should get answered.

If you write your dataframe with:

# index = True is default
my_pandas_df.to_csv('path')

#so this is same
my_pandas_df.to_csv('path', index=True)

And import with Dask:

import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')

It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).

How to figure it out:

my_dask_df = dd.read_csv('path')
my_dask_df.columns

which returns

Index(['Unnamed: 0', 'col 0', 'col 1',
       ...
       'col n'],
      dtype='object', length=...)

This gives `KeyError: 'Unnamed: 0'` – MattSom May 15 '20 at 23:50 — MattSom, May 15 '20 at 23:50

score 2 · Answer 3 · edited Sep 04 '21 at 07:45

2

Now you can write: df = pandas.read_csv(filename, index_col='column_name') (Where column name is the name of the column you want to set as the index).

edited Sep 04 '21 at 07:45

Alain Bianchini

3,883
1
7
27

answered Sep 03 '21 at 14:30

Sunil

21
3

Can I set the index column when reading a CSV using Python dask?

3 Answers3

How to figure it out: