3

I have a large file csv, it has 9600 columns and each column has a different type. When I read file using Dask Datafame and use attribute head(), I get error Mismatched dtypes found in pd.read_csv/pd.read_table. How can I ignore it. I use pandas read file csv don't have errors but very slow because the size of file is 2.5GB. Thank!

leoburgy
  • 356
  • 3
  • 9
  • 1
    That's a lot of columns! How many rows? – mdurant Nov 21 '18 at 14:50
  • Could you provide a sample of your csv file? along with code to import the file? – leoburgy Nov 21 '18 at 14:58
  • 1
    depending on your degree of knowledge about the potential data type present in the data set, you may want to cast the data type to string by setting the `dtype` argument in the `read_csv()` function. – leoburgy Nov 21 '18 at 15:03
  • In the foot note of `Dask` doc (), you can read that despite the inference about the data type, the presence of NaN can confuse the csv reader function. https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv – leoburgy Nov 21 '18 at 15:06
  • @mdurant It has 129013 rows – Hoàng Quốc Cường Nov 21 '18 at 15:13
  • I had a similar issue and forced all columns initially to be of type string: data = dd.read_csv(filename).astype(str) and then changed column types later on, as necessary (using lambdas) – Grant Shannon Nov 21 '18 at 15:33
  • @leoburgy The csv file is dataset for CNN model, so I don't cast the data type to string. – Hoàng Quốc Cường Nov 21 '18 at 15:41
  • 1
    ^ this doesn't tell us much. You said that the types were string or null, so explicitly loading as str sounds ok, but you have more information than we do. – mdurant Nov 21 '18 at 15:47

0 Answers0