I'm wondering why a simple line count using bash is giving me a different number of lines than that computed using python (version 3.6) for the files given here (train_en.txt
) and here (train_de.txt
). In bash, I'm using the command:
wc -l train_en.txt
wc -l train_de.txt
The outputs are 4520620 and 4520620, respectively.
In python, I'm using the commands:
print(sum(1 for line in open('train_en.txt')))
print(sum(1 for line in open('train_de.txt')))
The outputs are 4521327 and 4521186, respectively.
When I use the python commands
len(open('train_en.txt').read().splitlines())
len(open('train_de.txt').read().splitlines())
I get 4521334 and 4521186, respectively (for which the train_en.txt
results don't match those of the previous python command).
For reference, these are parallel corpora of text produced by concatenating the Common Crawl, Europarl, and News Commentary datasets (in that order) from the WMT '14 English to German translation task and should have the same number of lines.