4

After processing a big data set using Pandas/Dask, I saved the resulting data frame to a csv file.

When I try to read the output CSV using Dask, the data types are all objects by default. Whenever I try to convert them using conventional methods (e.g. defining data types while reading or reassigning them after reading) I keep getting errors regarding the conversion as seen below:

# ATTEMPT 1

import dask.dataframe as dd
header = ['colA', 'colB', ...]
dtypes = {'colA' : 'float', ...}
df = dd.read_csv('file.csv', names=header, dtype=types)

> TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'
> ...
> ValueError: could not convert string to float: 'colA'

-----------------------------------------------------------------------------------

# ATTEMPT 2

import dask.dataframe as dd
header = ['colA', 'colB', ...]
df = dd.read_csv('file.csv', names=header)
df['colA'] = df['colA'].astype(str).astype(float)

> ...
> File "/home/routar/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 730, in astype_nansafe
> ValueError: could not convert string to float: 'colA'

All the attributes in the original data frame (before converting to CSV) are ints/floats so the conversion is 100% possible. I'm also sure the values are valid.

I'm guessing this has something to do with Python's safe policy regarding data conversions.

Is there a workaround for this or any way to force the conversion?

mdurant
  • 27,272
  • 5
  • 45
  • 74
GRoutar
  • 1,311
  • 1
  • 15
  • 38

1 Answers1

4

When you read dataframe from csv using names=header, you result with names of columns in your first line of your dataframe.

That's why you get the error

ValueError: could not convert string to float: 'colA'

Because colA is the first value of your column.

So just add header=0 param to read_csv (to explicitly use first row as column names) to fix the problem:

df = dd.read_csv('file.csv', names=header, dtype=types, header=0)
Teoretic
  • 2,483
  • 1
  • 19
  • 28
  • Lol the error message lead me to think the issue lied on the column 'colA' instead of the actual value 'colA'. The types are now ok. – GRoutar Sep 20 '18 at 16:11