I have a large file csv, it has 9600 columns and each column has a different type. When I read file using Dask Datafame
and use attribute head()
, I get error Mismatched dtypes found in pd.read_csv/pd.read_table
. How can I ignore it. I use pandas read file csv don't have errors but very slow because the size of file is 2.5GB.
Thank!
Asked
Active
Viewed 2,107 times
3

leoburgy
- 356
- 3
- 9

Hoàng Quốc Cường
- 31
- 3
-
1That's a lot of columns! How many rows? – mdurant Nov 21 '18 at 14:50
-
Could you provide a sample of your csv file? along with code to import the file? – leoburgy Nov 21 '18 at 14:58
-
1depending on your degree of knowledge about the potential data type present in the data set, you may want to cast the data type to string by setting the `dtype` argument in the `read_csv()` function. – leoburgy Nov 21 '18 at 15:03
-
In the foot note of `Dask` doc (), you can read that despite the inference about the data type, the presence of NaN can confuse the csv reader function. https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv – leoburgy Nov 21 '18 at 15:06
-
@mdurant It has 129013 rows – Hoàng Quốc Cường Nov 21 '18 at 15:13
-
I had a similar issue and forced all columns initially to be of type string: data = dd.read_csv(filename).astype(str) and then changed column types later on, as necessary (using lambdas) – Grant Shannon Nov 21 '18 at 15:33
-
@leoburgy The csv file is dataset for CNN model, so I don't cast the data type to string. – Hoàng Quốc Cường Nov 21 '18 at 15:41
-
1^ this doesn't tell us much. You said that the types were string or null, so explicitly loading as str sounds ok, but you have more information than we do. – mdurant Nov 21 '18 at 15:47