20

I have a string column that sometimes has carriage returns in the string:

import pandas as pd
from io import StringIO

datastring = StringIO("""\
country  metric           2011   2012
USA      GDP              7      4
USA      Pop.             2      3
GB       GDP              8      7
""")
df = pd.read_table(datastring, sep='\s\s+')
df.metric = df.metric + '\r'  # append carriage return

print(df)
  country  metric  2011  2012
0     USA   GDP\r     7     4
1     USA  Pop.\r     2     3
2      GB   GDP\r     8     7

When writing to and reading from csv, the dataframe gets corrupted:

df.to_csv('data.csv', index=None)

print(pd.read_csv('data.csv'))
  country metric  2011  2012
0     USA    GDP   NaN   NaN
1     NaN      7     4   NaN
2     USA   Pop.   NaN   NaN
3     NaN      2     3   NaN
4      GB    GDP   NaN   NaN
5     NaN      8     7   NaN

Question

What's the best way to fix this? The one obvious method is to just clean the data first:

df.metric = df.metric.str.replace('\r', '')
petezurich
  • 9,280
  • 9
  • 43
  • 57
Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120

4 Answers4

18

Specify the line_terminator:

print(pd.read_csv('data.csv', line_terminator='\n'))

  country  metric  2011  2012
0     USA   GDP\r     7     4
1     USA  Pop.\r     2     3
2      GB   GDP\r     8     7

UPDATE:

In more recent versions of pandas (the original answer is from 2015) the name of the argument changed to lineterminator.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
  • Typo: the argument is actually called `line_terminator` – smci Jan 09 '20 at 13:40
  • 1
    This is actually an argument for `to_csv`, not for `read_csv`. Trying to use this line gives an error. – kilgoretrout Mar 09 '20 at 13:24
  • pandas changes over time. See update in the answer. – Mike Müller Mar 09 '20 at 16:47
  • 2023: Useful with Databricks somehow ignoring the `\r\n` while in Jupyter it works. Something to highlight is that Jupyter mentions that `line_terminator` is deprecated but in Databricks it complains and needs this way with underscore in the name. – Gerlex Jan 25 '23 at 10:26
6

To anyone else who is dealing with such an issue:

@mike-müller's answer doesn't actually fix the issue, and the file is still corrupted when it is read by other CSV readers (e.g. Excel). You need to fix this once you write the file rather than while reading it.

The problem lies in not quoting strings having newline characters (\r, \n, or \r\n depending on the OS style). This will not keep the CSV reader (e.g. pandas, Excel, etc.) from parsing the newline characters and then it messes up the loaded CSV file into having multiple lines per unquoted records.

The generalized newline char in Python is \r\n as you strip string by these chars e.g. str.strip('\r\n'). This will make Python identify and cover all OS newline styles.

In pandas, reading CSV file by line_terminator='\r\n' wraps all strings having either \n or \r into double quotes to preserve quoting and keep readers from parsing newline chars later.

Just to provide the code:

pd.to_csv('data.csv', line_terminator='\r\n'))

Shayan Amani
  • 5,787
  • 1
  • 39
  • 40
  • https://stackoverflow.com/questions/454725/python-get-proper-line-ending suggests that '\n' is actually closer to something like a generalized newline character than "\r\n". However, I think there is nothing like a "generalized newline char in Python". "\r\n" is just the Windows line terminator, not a general one, and it seems to work for you everywhere, probably because you are mostly using certain OSes. And that it works in Excel is just because Excel is a Windows application. I think your answer is misleading. – Daniel S. Dec 25 '21 at 11:39
  • @DanielS. generalized for the strip() function. As the [Python documentation on strip()](https://docs.python.org/3.4/library/stdtypes.html?highlight=strip#str.strip) expresses "The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped". So any combination of characters will be trimmed from both sides of the given string. – Shayan Amani Dec 25 '21 at 21:12
2

I got three working solutions. All of these look equally robust to me.

This one (credit should go to @Shayan Amani) works good because now read_csv considers only \n as the line separator, and therefore '\r' is just a character. Note that the behavior of to_csv will change by the platform; On windows, lines are separated by '\r\n'. This won't change the result though, thanks to the skip_blank_lines=True option of read_csv.

df.to_csv("tmp/test.csv", index=False)
pd.read_csv("tmp/test.csv", lineterminator="\n")

This one solves the problem by forcing the quote for text columns.

import csv
df.to_csv("tmp/test.csv", index=False, quoting=csv.QUOTE_NONNUMERIC)
pd.read_csv("tmp/test.csv")

Another option is to explicitly specify the line separator when saving. With this, text with '\r' is now quoted.

df.to_csv("tmp/test.csv", index=False, line_terminator="\r\n")
pd.read_csv("tmp/test.csv")
Kota Mori
  • 6,510
  • 1
  • 21
  • 25
0

In my case, applying quoting=csv.QUOTE_ALL solved the issue.

import csv
pd.to_csv('some_data.csv', quoting=csv.QUOTE_ALL)
John Prawyn
  • 1,423
  • 3
  • 19
  • 28