I'm new to Dask and thought this would be a simple task. I want to load data from multiple csv files and combine it into one Dask dataframe. in this example, there are 5 csv files with 10,000 rows of data in each. Obviously I want to give the combined dataframe a unique index.
So I did this:
import dask.dataframe as dd
# Define Dask computations
dataframes = [
dd.read_csv(os.path.join(data_dir, filename)).set_index('Unnamed: 0')
for filename in os.listdir(data_dir) if filename.endswith('.csv')
]
combined_df = dd.concat(dataframes).reset_index(drop=True)
If I then do combined_df.head().index
I get this as expected:
RangeIndex(start=0, stop=5, step=1)
But combined_df.tail().index
is not as expected:
RangeIndex(start=3252, stop=3257, step=1)
Further inspection reveals the index values on combined_df
consist of 15 separate series of roughly 3256 in length adding up to a total length of 50000. Note that the csv files all contain an index in the first column from 0 to 10000.
What is going on here and how do I get a standard integer index from 0 to 50000 which is the combined total number of rows in all the csv files?
Background
If you need to test the code above, here is a setup script to create some csv files:
import os
import numpy as np
import pandas as pd
# Create 5 large csv files (could be too big to fit all in memory)
shape = (10000, 1000)
data_dir = 'data'
if not os.path.exists(data_dir):
os.mkdir(data_dir)
for i in range(5):
filepath = os.path.join(data_dir, f'datafile_{i:02d}.csv')
if not os.path.exists(filepath):
data = (i + 1) * np.random.randn(shape[0], shape[1])
print(f"Array {i} size in memory: {data.nbytes*1e-6:.2f} MB")
pd.DataFrame(data).to_csv(filepath)
UPDATE:
The same problem seems to occur with this method:
combined_df = dd.read_csv(os.path.join(data_dir, '*.csv'))
print(dd.compute(combined_df.tail().index)[0])
print(dd.compute(combined_df.reset_index(drop=True).tail().index)[0])
RangeIndex(start=3252, stop=3257, step=1)
RangeIndex(start=3252, stop=3257, step=1)
Seems to me reset_index
method produces the same index.