3

I'm wondering why a simple line count using bash is giving me a different number of lines than that computed using python (version 3.6) for the files given here (train_en.txt) and here (train_de.txt). In bash, I'm using the command:

wc -l train_en.txt
wc -l train_de.txt

The outputs are 4520620 and 4520620, respectively.

In python, I'm using the commands:

print(sum(1 for line in open('train_en.txt')))
print(sum(1 for line in open('train_de.txt')))

The outputs are 4521327 and 4521186, respectively.

When I use the python commands

len(open('train_en.txt').read().splitlines())
len(open('train_de.txt').read().splitlines())

I get 4521334 and 4521186, respectively (for which the train_en.txt results don't match those of the previous python command).

For reference, these are parallel corpora of text produced by concatenating the Common Crawl, Europarl, and News Commentary datasets (in that order) from the WMT '14 English to German translation task and should have the same number of lines.

Vivek Subramanian
  • 1,174
  • 2
  • 17
  • 31
  • With which locale settings? Does running with `LC_ALL=C` exported modify behavior? – Charles Duffy Jun 26 '19 at 22:59
  • Red Hat Enterprise Linux Server, versjon 7.6, fedora – Vivek Subramanian Jun 26 '19 at 23:01
  • What's the output of the `locale` command? – Charles Duffy Jun 26 '19 at 23:02
  • This throws `UnicodeDecodeError` for me in Python 3.6 – JacobIRR Jun 26 '19 at 23:04
  • @CharlesDuffy, sorry, the output of `locale` is: LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= – Vivek Subramanian Jun 26 '19 at 23:05
  • @JacobIRR, yes, upon exporting `LC_ALL=C`, I get the same error. – Vivek Subramanian Jun 26 '19 at 23:06
  • Have you looked at which of those input datasets the issue happens in? If only a subset of them are correctly-encoded UTF-8, it'd be helpful to treat the faulty one(s) individually rather than needing to transform a dataset with content in different character sets mixed together. (Granted, it's not yet confirmed that that's the issue at hand, but it's a pretty good guess). – Charles Duffy Jun 26 '19 at 23:13
  • @CharlesDuffy, looking into it now. Will get back to you shortly. – Vivek Subramanian Jun 26 '19 at 23:14
  • (Making it `open('train_de.txt', encoding='utf-8')` will make the Python code less sensitive to `LC_CTYPE`, but doesn't do much for the discrepancy). – Charles Duffy Jun 26 '19 at 23:17
  • BTW, 4520620 is the correct count of the number of LF characters in the file; Python is presumably realizing that some of those LF characters are part of multi-byte "wide" characters (should the file be parsed as UTF-8), whereas your current locally-installed copy of `wc` is presumably not multi-byte-aware. – Charles Duffy Jun 26 '19 at 23:19
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195598/discussion-between-vivek-subramanian-and-charles-duffy). – Vivek Subramanian Jun 26 '19 at 23:20

1 Answers1

6

\ns can be treated as multi-byte characters rather than as an actual \n. One can avoid this by using bytestring encoding. The commands

print(sum(1 for line in open('train_en.txt', mode='rb')))
print(sum(1 for line in open('train_de.txt', mode='rb')))
len(open('train_en.txt', mode='rb').read().splitlines())
len(open('train_de.txt', mode='rb').read().splitlines())

all result in 4520620 (matching the output of wc -l), which means that the English and German corpora are parallel as desired.

Thanks to @CharlesDuffy for the help.

Vivek Subramanian
  • 1,174
  • 2
  • 17
  • 31