In which circumstances would the Unix command line utility 'wc' and Python's len(text.split()) give a different result?
A bit of context, although it shouldn't be relevant because the only thing we are doing here is counting words/tokens (ie. sets of characters separated by spaces). I am working with the German files of the IWSLT 2014 corpus and have already tokenized them with this script (ie. punctuation marks should be already tokenized etc). For the test and validation set, wc and Python give the same number of words (125754 words and 140433 words, respectively). For the training set, they do NOT. With Python 3 I get the following results:
python3 $ text = open('train.de','r').read()
python3 $ len(text.split())
3100720
While with the wc utility:
$ wc -w train.de
3100699 train.de
Notice that the difference is very subtle, but sufficient to be problematic. Only 21 words of difference in a text of about 3.1 millions words.
What could be happening? I have already checked both documentations and both functions should be equivalent.
Thanks in advance.
EDIT: additional information on my local environment. Ubuntu 16.04 with locale giving the following output:
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=es_ES.UTF-8
LC_TIME=es_ES.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=es_ES.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=es_ES.UTF-8
LC_NAME=es_ES.UTF-8
LC_ADDRESS=es_ES.UTF-8
LC_TELEPHONE=es_ES.UTF-8
LC_MEASUREMENT=es_ES.UTF-8
LC_IDENTIFICATION=es_ES.UTF-8
LC_ALL=