I build a dataframe in pandas (v21.1) (Python 3, Windows (220k rows) and write out to csv. Open in Excel and file looks fine (220k rows). Read in using pandas and now the file has an additional 40k rows and often has various encoding errors.
Have tried multiple to_csv
/ read_csv
encoding=
combinations, including:
utf-8
, utf-8-sig
, cp1252
, ascii
and utf-16
Write out:
encoding='cp1252' or 'ascii' - UnicodeEncodeError: 'charmap' codec can't encode character '\u1e28' in position 261: character maps to <undefined>
encoding='utf-8',`utf-8-sig`,`utf-16`,`cp1252`, - no Python error in the console, but still doesn't render correctly when I import it again.
When reading in I often get the warning:
DtypeWarning: Columns (0,1,3,4,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,25,26,27,28,29,30,31,32,37,38,39,40,41,42,43,46,47,48,49,50,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,91,92,93,94,95,96,97,98,99,100,101,102) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
I have tried specifying the dtypes
for the columns, by saving the dtypes
dict when to_csv
and using the same dict as input for read_csv
- but it also gave an error because unexpected datatypes found, e.g. ValueError: Integer column has NA values in column 33
When I do it out/in as Excel file, it seems to work fine. When I try with Python 2.7 installation, same issue occurs.
I suspect the issue is possibly with a 3rd-party csv file that I import, which only seems to import when I use 'cp1252'. I tried resaved this input file in Excel using utf-8 - but this hasn't worked either.
Thanks for your suggestions!