0

I build a dataframe in pandas (v21.1) (Python 3, Windows (220k rows) and write out to csv. Open in Excel and file looks fine (220k rows). Read in using pandas and now the file has an additional 40k rows and often has various encoding errors.

Have tried multiple to_csv/ read_csv encoding= combinations, including: utf-8, utf-8-sig, cp1252, ascii and utf-16 Write out:

encoding='cp1252' or 'ascii' - UnicodeEncodeError: 'charmap' codec can't encode character '\u1e28' in position 261: character maps to <undefined>
encoding='utf-8',`utf-8-sig`,`utf-16`,`cp1252`,  - no Python error in the console, but still doesn't render correctly when I import it again.

When reading in I often get the warning: DtypeWarning: Columns (0,1,3,4,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,25,26,27,28,29,30,31,32,37,38,39,40,41,42,43,46,47,48,49,50,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,91,92,93,94,95,96,97,98,99,100,101,102) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

I have tried specifying the dtypes for the columns, by saving the dtypesdict when to_csv and using the same dict as input for read_csv - but it also gave an error because unexpected datatypes found, e.g. ValueError: Integer column has NA values in column 33

When I do it out/in as Excel file, it seems to work fine. When I try with Python 2.7 installation, same issue occurs.

I suspect the issue is possibly with a 3rd-party csv file that I import, which only seems to import when I use 'cp1252'. I tried resaved this input file in Excel using utf-8 - but this hasn't worked either.

Thanks for your suggestions!

dreab
  • 705
  • 3
  • 12
  • 22
  • Which seperator are you using? This may cause additional rows in csv if it is not properly addressed. – Dev Dec 20 '17 at 09:16
  • @Ryu I have only tried with ',' – dreab Dec 20 '17 at 10:55
  • Does the csv file contains any commas other than the seperator? If yes I will suggest you to use another seperator when reading the csv.. – Dev Dec 20 '17 at 11:31

1 Answers1

0

DtypeWarning which you are getting is because pandas could not deduce the data type of all those columns. Setting them str in the dtype param will silence the warning.

Refer: https://stackoverflow.com/a/27232309/5182482

Read in using pandas and now the file has an additional 40k rows and often has various encoding errors.

I can not exactly tell you about this issue.

User124
  • 57
  • 1
  • 2
  • 8