What is the way to add an index column in Dask when reading from a CSV?

Question

I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique ID column to the dataset once read when using the read_csv method. I keep getting an error (see Code). I'm trying to create an index column so I can set that new column as the index for the data, but the error appears to be telling me to set the index first before creating the column.

CODE

df = dd.read_csv(r'path\to\file\file.csv')  # File does not have a unique ID column, so I have to create one.
df['index_col'] = dd.from_array(np.arange(len(pc_df)))  # Trying to add an index column and fill it
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

Update

Using range(1, len(df) + 1 changed the error to: TypeError: Column assignment doesn't support type range

score 6 · Accepted Answer · answered Oct 24 '19 at 00:51

6

Right, it's hard to know number of lines in each chunk of a CSV file without reading through it, so it's hard to produce an index like 0, 1, 2, 3, ... if the dataset spans multiple partitions.

One approach would be to create a column of ones:

df["idx"] = 1

and then call cumsum

df["idx"] = df["idx"].cumsum()

But note that this does add a bunch of dependencies to the task graph that backs your dataframe, so some operations might not be as parallel as they were before.

answered Oct 24 '19 at 00:51

MRocklin

55,641
23
163
235

I'll try it out and let you know how it goes! Is this generally the way to go when the dataset spans multiple partitions and you're loading the data from a csv? – ShockDoctor Oct 24 '19 at 13:45
Typically people just stop expecting a consistent incrementing index. – MRocklin Oct 25 '19 at 14:21

What is the way to add an index column in Dask when reading from a CSV?

CODE

Update

1 Answers1

Linked