Under Windows the standard EOL (end of line) terminator is a carriage return followed by a newline. When using the to_csv method on a dataframe that's what I get. However, when I use the to_csv method to write a gzip-compressed file I get two carriage returns in the file.
import pandas as pd, sys, gzip, zlib
print("python:", sys.version)
print("pandas:", pd.__version__)
print("zlib :", zlib.ZLIB_RUNTIME_VERSION)
df=pd.DataFrame(data={'c0':['a','b'], 'c1':['c','d']})
print(df)
# Under Windows the EOL marker is \r\n, so this works as expected
df.to_csv('df.csv', index=None)
with open('df.csv', 'rb') as f:
print("df.csv, default terminator :", f.read())
# with gzip it writes \r\r\n as EOL, looks like a bug
df.to_csv('df.csv.gz', index=None)
with gzip.open('df.csv.gz', 'rb') as f:
print("df.csv.gz, default terminator:", f.read())
# when specifying only a single '\n' that's what is written
df.to_csv('df.csv', index=None, line_terminator='\n')
with open('df.csv', 'rb') as f:
print("df.csv, '\\n' terminator :", f.read())
# when specifying only a single '\n' gzip it writes \r\n as EOL as desired
df.to_csv('df.csv.gz', index=None, line_terminator='\n')
with gzip.open('df.csv.gz', 'rb') as f:
print("df.csv.gz, '\\n' terminator :", f.read())
Here is the output:
python: 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 18:50:55) [MSC v.1915 64 bit (AMD64)]
pandas: 0.24.0
zlib : 1.2.11
c0 c1
0 a c
1 b d
df.csv, default terminator : b'c0,c1\r\na,c\r\nb,d\r\n'
df.csv.gz, default terminator: b'c0,c1\r\r\na,c\r\r\nb,d\r\r\n'
df.csv, '\n' terminator : b'c0,c1\na,c\nb,d\n'
df.csv.gz, '\n' terminator : b'c0,c1\r\na,c\r\nb,d\r\n'
This clearly relates to a previously discussed issue at CSV in Python adding an extra carriage return, on Windows. My issue is that the behavior differs for compressed vs uncompressed files. Is this a known issue?