Using half-space in R package quanteda

Question

I am using the KWIC function in quanteda package in R to look up some phrases in Kurdish. In Kurdish, some compound words and phrases are separated by half-space. When I use a phrase including a half-space, R considers it as a typo(the red dot) and does not let me run the command. Is there a way to fix this?

The half-space or a zero-width non-joiner is used in some languages to avoid a ligature when normalizing a text. Its Unicode character is '\u200c' and in some text-editors, it can be shown on the screen with a SHIFT+SPACE.

kwic(cleantest, phrase("له‌لایه‌نی"), window = 1)

Here is the image of the error

Also, do you know of a Sorani Kurdish POS Tagger and a Stemmer?

Can you provide an example of some text (i.e. in `cleantest`)? Kurdish reads right to left, which might be part of the issue here. — Ken Benoit, Apr 23 '18 at 00:16

score 1 · Accepted Answer · answered Apr 23 '18 at 03:06

Interesting problem. We have been thinking about this here and here recently.

Apparently the problem arises in the phrase conversion to a list, which relies on whitespace splitting. Here is a workaround to ensure that the half-spaces are converted into full spaces:

txt <- "رۆژنامه‌كانى به‌ریتانیا، ئاماژه‌ بۆ ئه‌وه‌ ده‌كه‌ن كه‌ سه‌ره‌ڕای ئه‌وه‌ی ڤینگه‌ر ده‌زانێت له‌ وه‌رزی داهاتوودا گه‌وره‌ترین كێشه‌ی له‌لایه‌نی گۆڵپارێزی ده‌بێت، به‌ڵام له‌گه‌ڵ ئه‌وه‌شدا ئاماده‌ نییه‌ به‌هیچ .شێوه‌یه‌ك پیته‌ر چیك له‌سه‌ر كورسی یه‌ده‌گ دابنێت "

phrase2 <- function(x) phrase(gsub("\\s", " ", x))

kwic(txt, phrase2("له‌لایه‌نی"), window = 1)

# [text1, 33:35] ی | له لایه نی | گۆڵپارێزی

And no, I do not know of a Sorani Kurdish POS Tagger and a Stemmer, although the stopwords package does include Kurdish stopwords.

stopwords("ku", source = "stopwords-iso")
#  [1] "ئێمە"     "ئێوە"     "ئەم"      "ئەو"      "ئەوان"    "ئەوەی"   
#  [7] "بۆ"       "بێ"       "بێجگە"    "بە"       "بەبێ"     "بەدەم"   
# [13] "بەردەم"   "بەرلە"    "بەرەوی"   "بەرەوە"   "بەلای"    "بەپێی"   
# [19] "تۆ"       "تێ"       "جگە"      "دوای"     "دوو"      "دە"      
# [25] "دەکات"    "دەگەڵ"    "سەر"      "لێ"       "لە"       "لەبابەت" 
# [31] "لەباتی"   "لەبارەی"  "لەبرێتی"  "لەبن"     "لەبەر"    "لەبەینی" 
# [37] "لەدەم"    "لەرێ"     "لەرێگا"   "لەرەوی"   "لەسەر"    "لەلایەن" 
# [43] "لەناو"    "لەنێو"    "لەو"      "لەپێناوی" "لەژێر"    "لەگەڵ"   
# [49] "من"       "ناو"      "نێوان"    "هەر"      "هەروەها"  "و"       
# [55] "وەک"      "پاش"      "پێ"       "پێش"      "چەند"     "کرد"     
# [61] "کە"       "ی"

Smart solution. It worked. Thank you so much for both answers. — Ali, Apr 23 '18 at 06:35

Using half-space in R package quanteda

1 Answers1