2

I'm using pandas to load csv files created by excel, do some analysis, and then save results to csv files. I've noticed the pandas to_csv and from_csv methods don't seem capable of handling special characters such as \r but don't raise any errors either.

In [7]: import pandas as pd

In [8]: data = {               
'A': ['one', 'two', 'three'],
'B': ['four', 'five', 'six']
}

In [9]: df = pd.DataFrame(data)

In [10]: df
Out[10]: 
       A     B
0    one  four
1    two  five
2  three   six

In [11]: df.loc[1,'A'] = 't' + '\r' + 'o'

In [12]: df
Out[12]: 
       A     B
0    one  four
1   t\ro  five
2  three   six

In [13]: df.to_csv("my_df.csv")

In [14]: df2 = pd.DataFrame.from_csv("my_df.csv")

In [15]: df2
Out[15]: 
       A     B
0    one  four
1      t   NaN
o   five   NaN
2  three   six

Since I'm not specifying any encoding here I am assuming it is using ASCII but even when I specify encoding='utf-8' for the writing and reading I get the same result.

How do I write a robust csv writing and reading script so that the rows/columns are not corrupted or other unexpected things happen? If the only solution is to check and 'clean' every string before writing to csv then what is the easiest way to do that?

Bill
  • 10,323
  • 10
  • 62
  • 85
  • CSV is a rather wild format with a ton of dialects. RFC 4180 tries to define it. There. Carriage return (`\r`) is not allowed inside a TEXTDATA block. It is part of the line (record) delimeter. – Klaus D. Aug 30 '16 at 02:25

2 Answers2

2

Unless someone has a better suggestion, I'm dealing with the specific issue as follows - by pre-processing every csv file before loading with Pandas. It seems to work on my current system but not convinced it's fool proof.

In [30]: f = open("my_df.csv")

In [31]: content = f.read().replace('\r',' ')

In [32]: with open("my_df2.csv", "w") as g:
   ....:     g.write(content)
   ....:     

In [33]: df2 = pd.DataFrame.from_csv("my_df2.csv")

In [34]: df2
Out[34]: 
       A     B
0    one  four
1    t o  five
2  three   six
Bill
  • 10,323
  • 10
  • 62
  • 85
  • You don't need to write it back to file after preprocessing. Use `StringIO(content)` as a file handle, e.g.: `df2 = pd.read_csv(StringIO(content), ...)`. – ptrj Aug 30 '16 at 17:01
  • Also: python3 seems to automatically replace `\r` to `\n` on reading. So, your `f.read().replace('\r',' ')` wouldn't work. In that case, maybe read in binary mode and then convert/decode bytes to strings. – ptrj Aug 30 '16 at 17:12
2

Preprocessing may be the best option. But if you're looking for something else, you may try lineterminator argument in read_csv:

df = pd.read_csv("my_df.csv", index_col=0, lineterminator='\n')

(Works for me on linux but can't guarantee for other platforms.)

ptrj
  • 5,152
  • 18
  • 31