half space (\u200c) don't support in CountVectorizer

Question

in CountVectorizer of python's library, i want to persian words that contain half space be one token not two word .

I will be grateful to guide me. thank you.

i used "درخت‌های زیبا" in CountVectorizer . i wanted it to turn into ["درخت‌های","زیبا"] but turned into ["درخت","ها","زیبا"] .

why don't you remove "half-space" and replace it with nothing, this way "درخت‌های زیبا" will be "درختهای زیبا" and it's fine I think. also this character is called zero with nonjoiner — mqod, Feb 05 '23 at 09:20
@Sedghian The problem is that `CountVectorizer` is using token_pattern of `(?u)\b\w\w+\b`. In Unicode complaint, regular expression engines \w includes ZWJ and ZWNJ. But python's inbuilt regular expression engine excludes them. You might want to add a custom token_pattern parameter to `CountVectorizer`. Maybe something like `(?u)\b\w+[\u200C\u200D]?\w+\b`. — Andj, Mar 25 '23 at 13:03

Andj · Answer 1 · 2023-04-19T01:25:39.040

CountVectorizer is using the default token_pattern (?u)\b\w\w+\b. The regex metacharacter \w in Python's core regular expression engine does not include ZWJ and ZWNJ.

There are two approaches that can be taken:

Use a custom token_pattern; or
Set token_pattern to None and define your own tokenizer.

Python's \w, used by scikit-learn, is not compatible with the Unicode definition. Where the definition matters, the second approach would be preferred.

1) Custom token_pattern

In this scenario, we specify a custom regex pattern that adds ZWJ and ZWNJ to the pattern. Using ICU, allows language specific boundary analysis:

from sklearn.feature_extraction.text import CountVectorizer
s = ["درخت‌های زیبا"]
cv1 = CountVectorizer(
    token_pattern = r'(?u)\b\w+[\u200C\u200D]?\w+\b'
)
cv1.fit(s)
print(*cv2.vocabulary_, sep="\n")
# درخت‌های
# زیبا

The input string is divided into two words.

2) Custom tokenizer

In this scenario, I will use an ICU4C break iterator. The break iterator returns the indexes for break boundaries, it is necessary to process results of the break iteration.

N.B. token_pattern needs to be set to None to use tokenizer.

import icu
from sklearn.feature_extraction.text import CountVectorizer
import regex as re

bi = icu.BreakIterator.createWordInstance(icu.Locale('fa_IR'))
def tokenise(text, interator=bi, strip_punct=True):
    interator.setText(text)
    tokens = []
    start = interator.first()
    for end in interator:
        if strip_punct:
            if not re.match('[\p{Z}\p{N}\p{P}]+', text[start:end]):
                tokens.append(text[start:end])
        else:
            tokens.append(text[start:end])
        start = end
    return tokens

s = ["درخت‌های زیبا"]
cv2 = CountVectorizer(
    tokenizer = tokenise,
    token_pattern = None
)
cv3.fit(s)
print(*cv3.vocabulary_, sep="\n")
# درخت‌های
# زیبا

2B) Custom tokenizer using regex

There is a variation of the custom tokeniser, where we use the default regular expression pattern for tokenisation with an alternative regular expression engine. The default behaviour and the reason it fails for Persian and many other languages is because the definition of \w in core Python differs from the Unicode definition. If we use a more compliant version of regex, the original pattern used by CountVectorizer will work with most languages, just not Persian.

from sklearn.feature_extraction.text import CountVectorizer
import regex as re

s = ["درخت‌های زیبا"]
def tokenise(text):
    return re.findall(r'(?u)\b\w\w+\b', text)
cv = CountVectorizer(
    tokenizer = tokenise,
    token_pattern = None
)
cv.fit(s)
print(*cv.vocabulary_, sep="\n")
# درخت‌های
# زیبا

half space (\u200c) don't support in CountVectorizer

1 Answers1

1) Custom token_pattern

2) Custom tokenizer

2B) Custom tokenizer using regex