CountVectorizer
is using the default token_pattern (?u)\b\w\w+\b
. The regex metacharacter \w
in Python's core regular expression engine does not include ZWJ and ZWNJ.
There are two approaches that can be taken:
- Use a custom
token_pattern
; or
- Set
token_pattern
to None
and define your own tokenizer
.
Python's \w
, used by scikit-learn, is not compatible with the Unicode definition. Where the definition matters, the second approach would be preferred.
1) Custom token_pattern
In this scenario, we specify a custom regex pattern that adds ZWJ and ZWNJ to the pattern. Using ICU, allows language specific boundary analysis:
from sklearn.feature_extraction.text import CountVectorizer
s = ["درختهای زیبا"]
cv1 = CountVectorizer(
token_pattern = r'(?u)\b\w+[\u200C\u200D]?\w+\b'
)
cv1.fit(s)
print(*cv2.vocabulary_, sep="\n")
# درختهای
# زیبا
The input string is divided into two words.
2) Custom tokenizer
In this scenario, I will use an ICU4C break iterator. The break iterator returns the indexes for break boundaries, it is necessary to process results of the break iteration.
N.B. token_pattern
needs to be set to None
to use tokenizer
.
import icu
from sklearn.feature_extraction.text import CountVectorizer
import regex as re
bi = icu.BreakIterator.createWordInstance(icu.Locale('fa_IR'))
def tokenise(text, interator=bi, strip_punct=True):
interator.setText(text)
tokens = []
start = interator.first()
for end in interator:
if strip_punct:
if not re.match('[\p{Z}\p{N}\p{P}]+', text[start:end]):
tokens.append(text[start:end])
else:
tokens.append(text[start:end])
start = end
return tokens
s = ["درختهای زیبا"]
cv2 = CountVectorizer(
tokenizer = tokenise,
token_pattern = None
)
cv3.fit(s)
print(*cv3.vocabulary_, sep="\n")
# درختهای
# زیبا
2B) Custom tokenizer using regex
There is a variation of the custom tokeniser, where we use the default regular expression pattern for tokenisation with an alternative regular expression engine. The default behaviour and the reason it fails for Persian and many other languages is because the definition of \w
in core Python differs from the Unicode definition. If we use a more compliant version of regex, the original pattern used by CountVectorizer
will work with most languages, just not Persian.
from sklearn.feature_extraction.text import CountVectorizer
import regex as re
s = ["درختهای زیبا"]
def tokenise(text):
return re.findall(r'(?u)\b\w\w+\b', text)
cv = CountVectorizer(
tokenizer = tokenise,
token_pattern = None
)
cv.fit(s)
print(*cv.vocabulary_, sep="\n")
# درختهای
# زیبا