4

In which circumstances would the Unix command line utility 'wc' and Python's len(text.split()) give a different result?

A bit of context, although it shouldn't be relevant because the only thing we are doing here is counting words/tokens (ie. sets of characters separated by spaces). I am working with the German files of the IWSLT 2014 corpus and have already tokenized them with this script (ie. punctuation marks should be already tokenized etc). For the test and validation set, wc and Python give the same number of words (125754 words and 140433 words, respectively). For the training set, they do NOT. With Python 3 I get the following results:

python3 $ text = open('train.de','r').read()
python3 $ len(text.split())
3100720

While with the wc utility:

$ wc -w train.de 
3100699 train.de

Notice that the difference is very subtle, but sufficient to be problematic. Only 21 words of difference in a text of about 3.1 millions words.

What could be happening? I have already checked both documentations and both functions should be equivalent.

Thanks in advance.

EDIT: additional information on my local environment. Ubuntu 16.04 with locale giving the following output:

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=es_ES.UTF-8
LC_TIME=es_ES.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=es_ES.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=es_ES.UTF-8
LC_NAME=es_ES.UTF-8
LC_ADDRESS=es_ES.UTF-8
LC_TELEPHONE=es_ES.UTF-8
LC_MEASUREMENT=es_ES.UTF-8
LC_IDENTIFICATION=es_ES.UTF-8
LC_ALL=
nohamk
  • 325
  • 2
  • 11
  • 1
    The `wc` result is probably dependent on your locale environment variables. – Barmar Mar 07 '19 at 18:41
  • 4
    If you wanted to track down the specific lines for which results differ, this is a case where a bisection algorithm would be very handy. Split the file in half; at least one of those two halves will still have a delta; pick one of them, split it again, repeat until you have a single line. – Charles Duffy Mar 07 '19 at 18:43
  • I'm using Ubuntu 16.04 with default settings. Which additional information should I provide? Thanks. – nohamk Mar 07 '19 at 18:44
  • @Jose, the output of the `locale` command might be a place to start. Whether behavior differs with `LC_ALL=C wc -w train.de` (if your initial locale is something else) could also be pertinent. – Charles Duffy Mar 07 '19 at 18:44
  • Since it's German text, I suspect it has to do with whether `wc` recognizes multi-byte characters. – Barmar Mar 07 '19 at 18:44
  • *tries to remember if `open(..., 'r')` defaults binary or text mode in Python 3* – Charles Duffy Mar 07 '19 at 18:46
  • @Jose, ...so you're working with German text, but your system is mostly configured for Spanish? – Charles Duffy Mar 07 '19 at 18:47
  • @CharlesDuffy already added the output of the locale command. I'll add results with LC_ALL=C wc -w train.de asap. – nohamk Mar 07 '19 at 18:48
  • 1
    I wonder if you might need `LC_ALL=de_DE.UTF-8` or such. Would be easier if you went the bisection route to isolate some specific lines for which the two sources differ; providing that content would mean we could repro the issue & test proposed fixes ourselves. – Charles Duffy Mar 07 '19 at 18:50
  • @CharlesDuffy 'r' is text mode, byte mode is 'rb'. The configuration is mostly in English and Spanish, but with UTF-8. Also, according to https://stackoverflow.com/questions/48816403/regarding-the-unix-command-wc-what-is-considered-as-a-word 'A word is a non-zero-length sequence of characters delimited by white space.' How could that be influencing the result? I'll try to change to German and checking the results. Thanks. – nohamk Mar 07 '19 at 18:51
  • could there a be difference on how it is counting around a new line? I recommend doing bisections per the comment above and showing us the lines that are causing the difference. – ffejrekaburb Mar 07 '19 at 18:52
  • To give you a concrete example of how `LC_ALL` can change behavior here -- `LC_CTYPE` determines which characters are *considered* "white space" (or members of other character classes) within a given locale. – Charles Duffy Mar 07 '19 at 18:53
  • @ffejrekaburb perhaps, but then why in 3.4 million words (adding the 3 files) only happens 20 times? How could I check it? – nohamk Mar 07 '19 at 18:54
  • start slicing the text body into chunks. Focus on the chunks that are showing differences and keep dividing those into further chunks until only a few lines are left. – ffejrekaburb Mar 07 '19 at 18:56
  • @CharlesDuffy Understood, very relevant therefore. – nohamk Mar 07 '19 at 18:56
  • 1
    As suggested by @CharlesDuffy and @ ffejrekaburb, I'm going to try to slice the text. – nohamk Mar 07 '19 at 18:57

1 Answers1

0

Not sure if it was your case, but it may be useful for someone. On my system with python 3.6 split() splits on non-breaking space (\xa0), whereas wc -w does not.

Hlib Babii
  • 599
  • 1
  • 7
  • 24