How do I write code to determine whether the EOL character in a CSV file is `\r` or `\n` without looking at the file contents?

Question

I'm using Python in Jupyter Notebooks to work with a CSV file. I'm writing the same code in two different versions of Jupyter Notebook--one that's running directly on my computer and another that's running off a kind of emulator within an online lesson from Dataquest. When I open the CSV file and read it into a string on my computer's Jupyter Notebook, the EOL character is \r but when I do the same on Dataquest's emulator, the EOL character is \n. I have two questions:

Why does this happen?
How can I write a Python code that tests for the EOL character without opening the file to find out visually?

This code in in a Jupyter notebook on my own Mac.

f = open('US_births_1994-2003_CDC_NCHS.csv', 'r')
data_MyComp = f.read()
data_MyComp

This code is on Dataquest's Jupyter notebook browser emulator.

f = open('US_births_1994-2003_CDC_NCHS.csv', 'r')
data_dataquest = f.read()
data_dataquest

This is a few lines of output from my computer when I run data_MyComp (note the EOL character is \r).

'year,month,date_of_month,day_of_week,births\r1994,1,1,6,8096\r1994,1,2,7,7772\r1994,1,3,1,10142\r1994,1,4,2,11248\r1994,1,5,3,11053\r1994,1,6,4,11406\r1994,1,7,5,11251\r1994,1,8,6,8653\r1994,1,9,7,7910\r1994,1,10,1,10498\r1994,1,11,2,11706\r

This is a few lines of output from the Dataquest emulator when I run data_dataquest (note the EOL character is \n).

'year,month,date_of_month,day_of_week,births\n1994,1,1,6,8096\n1994,1,2,7,7772\n1994,1,3,1,10142\n1994,1,4,2,11248\n1994,1,5,3,11053\n1994,1,6,4,11406\n

https://docs.python.org/3/library/functions.html#open the `newline` flag handles that for you, or am I missing something? — gold_cy, Jan 02 '19 at 00:06
I suppose "opening the file" really means "manual inspection" here. In order to process the contents of a file you *have* to `open()` it. — tripleee, Jan 02 '19 at 00:12
Is your own computer by any chance running Windows? How exactly are you making the file available to Jupyter? — tripleee, Jan 02 '19 at 00:14
If you just want to read the CSV file, use the [`csv`](https://docs.python.org/3/library/csv.html) module from the standard library. It should properly handle the line endings on its own. — mkrieger1, Jan 02 '19 at 00:45
@tripleee Yes, I mean "manual inspection". Thanks for clarifying. — Data2Dollars, Jan 02 '19 at 05:20
@tripleee I edited the question a bit further. Hopefully, it's clearer. — Data2Dollars, Jan 02 '19 at 05:26
@tripleee I'm on a Mac running OSX not a Windows machine. The file is available here: https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv I saved it as a CSV to my machine and then opened it using the code shown above. — Data2Dollars, Jan 02 '19 at 05:27

tripleee · Answer 1 · 2019-01-02T05:44:47.470

Without any indication of how you downloaded or otherwise made the file available to Python and Jupyter, we can't really tell why this is happening. Line endings are platform-specific but Python 3 should generally neutralize differences between platforms unless you specifically request opening a file as "binary".

You can discover the line-ending conventions by simply opening the file and reading enough of it. What's "enough" depends on the file type. Perhaps something like this in your case:

with open('US_births_1994-2003_CDC_NCHS.csv', 'rb') as peek:
    buf = peek.read(1024)
    if b'\r\n' in peek:
        print("DOS CR/LF line terminator")
    elif b'\r' in peek:
        print("Plain CR seen (legacy Mac or CP/M file)?")
    elif b'\n' in peek:
        print("Plain LF seen (standard Unix text file)")

This doesn't attempt to do any statistical analysis, but might work well enough for your limited case. The file will be closed again after the end of the with block so you can then just open it a second time with the parameters you actually need.

How do I write code to determine whether the EOL character in a CSV file is `\r` or `\n` without looking at the file contents?

1 Answers1