How to ignore dtype when read file csv using Dask Dataframe

Asked Nov 21 '18 at 14:47

Active Nov 21 '18 at 15:09

Viewed 2,107 times

I have a large file csv, it has 9600 columns and each column has a different type. When I read file using Dask Datafame and use attribute head(), I get error Mismatched dtypes found in pd.read_csv/pd.read_table. How can I ignore it. I use pandas read file csv don't have errors but very slow because the size of file is 2.5GB. Thank!

edited Nov 21 '18 at 15:09

leoburgy

asked Nov 21 '18 at 14:47

Hoàng Quốc Cường

1

That's a lot of columns! How many rows? – mdurant Nov 21 '18 at 14:50
Could you provide a sample of your csv file? along with code to import the file? – leoburgy Nov 21 '18 at 14:58
1

depending on your degree of knowledge about the potential data type present in the data set, you may want to cast the data type to string by setting the `dtype` argument in the `read_csv()` function. – leoburgy Nov 21 '18 at 15:03
In the foot note of `Dask` doc (), you can read that despite the inference about the data type, the presence of NaN can confuse the csv reader function. https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv – leoburgy Nov 21 '18 at 15:06
@mdurant It has 129013 rows – Hoàng Quốc Cường Nov 21 '18 at 15:13
I had a similar issue and forced all columns initially to be of type string: data = dd.read_csv(filename).astype(str) and then changed column types later on, as necessary (using lambdas) – Grant Shannon Nov 21 '18 at 15:33
@leoburgy The csv file is dataset for CNN model, so I don't cast the data type to string. – Hoàng Quốc Cường Nov 21 '18 at 15:41
1

^ this doesn't tell us much. You said that the types were string or null, so explicitly loading as str sounds ok, but you have more information than we do. – mdurant Nov 21 '18 at 15:47

How to ignore dtype when read file csv using Dask Dataframe

0 Answers0