3

I found that Dask can read several csv files this way:

import dask.dataframe as dd
df = dd.read_csv('myfiles.*.csv')  # doctest: +SKIP

But what if I want to load not all but some of them:

my_files = ['file1.csv', 'file3.csv','file7.csv']
df = dd.read_csv(my_files)

But that doesn't work:

ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements

Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102

1 Answers1

3

My error was that some of my csv files had different number of columns. Reading a list of files to one dataframe works next way:

to get dask.dataframe:

df = dd.read_csv(["small1.csv", "small2.csv"])
print(df.shape)
print(type(df))

Output:

(Delayed('int-863f32f2-a8c3-4ac9-b31f-0186541c347c'), 3) 
<class 'dask.dataframe.core.DataFrame'>

To get pandas.dataframe:

df = dd.read_csv(["small1.csv", "small2.csv"])
df = df.compute()
print(df.shape)
print(type(df))

Output:

(11000, 3)
<class 'pandas.core.frame.DataFrame'>
Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102