tokenizing a list doesn't work with UTF8

Question

I extract some data from Oracle DB to do some text mining. My data is UTF8 and vocab can't handle it.

library(text2vec);
library(DBI);
Sys.setenv(TZ="+03:00");
drv=dbDriver("Oracle");
con=dbConnect(drv,username="user","pass",dbname="IP:port/servicename");

list=dbGetQuery(con,statement = "select * from test");

it_list = itoken(list$FNAME, 
                  preprocessor = tolower, 
                  tokenizer = word_tokenizer, 
                  ids = list$ID, 
                  progressbar = FALSE);

vocab = create_vocabulary(it_list, ngram = c(ngram_min = 1L, ngram_max =2L));

but just English word exists in vocab.

list variable object exists in this link (can be loaded with load())
I use windows
R.version:

platform x86_64-w64-mingw32 arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string Oracle Distribution of R version 3.3.0 (2016-05-03) nickname Supposedly Educational

score 1 · Accepted Answer · answered Sep 19 '17 at 07:43

1

Thanks for reporting. This is actually an issue with base::strsplit() which is used for basic tokenization.

I suggest you to use stringi package for regex with strong UTF-8 support. Or simply use tokenizers - good solution for tokenization on top of stringi.

For example you can use tokenizers::tokenize_words as drop-in replacement of word_tokenizer

tokenizers::tokenize_words("پوشاک بانک لي ")
# "پوشاک" "بانک"  "لي"

For some reason base::strsplit() doesn't consider theses arabic symbols as "alphanumeric" ([[:alnum:]]).

strsplit("i was. there", "\\W") %>% lapply(function(x) x[nchar(x) > 0])
# "i"     "was"   "there"
strsplit("پوشاک بانک لي ", "\\W") %>% lapply(function(x) x[nchar(x) > 0])
# character(0)

answered Sep 19 '17 at 07:43

Dmitriy Selivanov

4,545
1
22
38

1

That is Persian symbols :) – parvij Sep 19 '17 at 09:20
but alphabet is the same, isn't it? – Dmitriy Selivanov Sep 19 '17 at 09:44
You're right, it's different like German and English alphabet. – parvij Sep 19 '17 at 10:00

tokenizing a list doesn't work with UTF8

1 Answers1