0

The situation is as following:

With the following piece of code:

import re

content = ''
count = len(re.split('\W+', content, flags=re.UNICODE))

print(count)

# Output is expected to be 0, as it has no words
# Instead output is 1

What is going wrong? All other word counts are correct.

EDIT: It also happens when we use a string content = '..' or content = '.!' thus this in NOT a problem related in any sense with python's split() function but with the regular expressions from re.

IMPORTANT NOTE: Although the solution I gave works in my particular case the correct solution is not yet met. Because it's an regex issue which isn't yet 100% SOLVED!

CMPSoares
  • 4,175
  • 3
  • 24
  • 42
  • Not a duplicate because re an independent library and is expected to return only word and filter the occurrence mentioned in the other post. – CMPSoares Mar 13 '15 at 01:45
  • how it returns only word characters where the input string is empty? – Avinash Raj Mar 13 '15 at 01:47
  • possible duplicate http://stackoverflow.com/questions/28970724/python-split-empty-string – Avinash Raj Mar 13 '15 at 01:48
  • No @AvinashRaj, it returns an array with an empty string when the input is an empty string in the link you mentioned this works correctly in the split function and doesn't when the string is `"\n"`. This is something that has to do with the `re.split()` function. – CMPSoares Mar 13 '15 at 01:54
  • It has to do with this: https://docs.python.org/2/library/re.html#re.split – CMPSoares Mar 13 '15 at 01:56
  • It also happens when the string `'..'` is used... – CMPSoares Mar 13 '15 at 02:00
  • What happens with this `..` input? It spits according to one or more non-word characters and finally returns you two empty strings. It works correctly. – Avinash Raj Mar 13 '15 at 02:03
  • No @AvinashRaj, It does not work correctly in the scope of the problem. I want to count the existing words. Not "empty" words... – CMPSoares Mar 13 '15 at 02:11

1 Answers1

0

Well found out what the reason is:

When re.split() is used, it splits a string based on the regular expression given and returns an array of strings. If string is empty and thus there is nothing to split, it apparantly return an array with an empty string in it (['']). So when the len() function is used it counts an array with 1 element.

A solution to this is the following piece of code:

import re

content = ''
count = [len(re.split('\W+', content, flags=re.UNICODE)), 0][content == '']

print(count)

# Output is as expected, 0, by using a simple if statement
# that verifies if string is empty, when it's empty it return 0,
# otherwise, it returns the word count.
CMPSoares
  • 4,175
  • 3
  • 24
  • 42