Does tidytext::unnest_tokens works with spanish characters?

Question

I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams.

The code works fine on Linux. I added some info on the locale.

library(tidytext)
library(dplyr)

df <- data_frame(
  text = "César Moreira Nuñez"
)

# works ok:
df %>% 
  unnest_tokens(word, text)


# # A tibble: 3 x 1
# word
# <chr>
# 1 césar
# 2 moreira
# 3 nuñez

# breaks é and ñ
df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2 )

# # A tibble: 2 x 1
# bigram
# <chr>
# 1 cã©sar moreira
# 2 moreira nuã±ez

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Can you post the output of `Sys.getlocale()` as well? Will help with debugging. — BrodieG, Dec 08 '17 at 14:42
I can't reproduce this, though I strongly suspect it's a [Unicode normalization](https://en.wikipedia.org/wiki/Unicode_equivalence) issue. stringi has conversion functions; see `?stringi::stri_trans_nfc`. — alistaire, Dec 08 '17 at 15:31

score 2 · Answer 1 · answered Dec 08 '17 at 14:20

It seems that it happens when you change the token argument to ngrams. I m not sure why it does that, but here is a work around using package qlcMatrix

library(qlcMatrix)

splitStrings(df$text, sep = ' ', bigrams = TRUE, boundary = FALSE, bigram.binder = ' ')$bigrams
#[1] "César Moreira" "Moreira Nuñez"

score 2 · Answer 2 · answered Dec 11 '17 at 17:25

We have chatted with several people who have run into issues with encoding before, with Polish and Estonian. It's always a bit tricky because I can never reproduce the problem locally, as I cannot with your problem:

library(tidytext)
library(dplyr)

df <- data_frame(
  text = "César Moreira Nuñez"
)

df %>% 
  unnest_tokens(word, text)
#> # A tibble: 3 x 1
#>   word   
#>   <chr>  
#> 1 césar  
#> 2 moreira
#> 3 nuñez

df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2 )
#> # A tibble: 2 x 1
#>   bigram       
#>   <chr>        
#> 1 césar moreira
#> 2 moreira nuñez

You say that your code works fine on Linux, and this aligns with others' experience as well. This seems to always be a Windows encoding issue. This isn't related to the code in the tidytext package, or even the tokenizers package; from what I've seen, I suspect this is related to the C libraries in stringi and how they act on Windows compared to other platforms. Because of this, you'll likely have the same problems with anything that depends on stringi (which is practically ALL of NLP in R).

This problem with tokenizers is now resolved and should work on all platforms, including Windows: https://github.com/ropensci/tokenizers/issues/58 I have no idea why @meczupevi's answer below was deleted; it's extremely relevant this question. — Julia Silge, Mar 21 '18 at 03:00

score 1 · Answer 3 · answered Dec 08 '17 at 15:00

1

Digging in the source code for tidytext, it looks like the words and ngrams are split using the tokenizer package. Those functions use different methods: tokenize_words uses stri_split, whereas tokenize_ngrams uses custom C++ code.

I imagine the final step -- switching between R and C++ data types -- garbles the diacritics, although I can't explain why precisely.

answered Dec 08 '17 at 15:00

David Klotz

2,401
1
7
16

Your assessment seems correct to me: this is a bug in `tokenizers`. The C++ source for skip_ngrams never specifies the encoding. Most likely, it defaults to the native encoding, which is UTF-8 on Linux and MacOS but Windows-1252 on Windows. – Patrick Perry Dec 11 '17 at 18:34
I filed a bug report at https://github.com/ropensci/tokenizers/issues/58 – Patrick Perry Dec 11 '17 at 18:43

score 0 · Answer 4 · answered Dec 11 '17 at 17:52

I don't know what the problem is, but I was able to reproduce it. I can also confirm that the following works on Windows:

library(corpus)
df %>% term_counts(ngrams = 2)
#>   text term          count
#> 1 1    césar moreira     1
#> 2 1    moreira nuñez     1

The result here is much like that of unnest_tokens, but is aggregates by term and does not retain the other variables in df. To get results like unnest_tokens gives you, join the result with df using the text column, something like:

y <- df %>% term_counts(ngrams = 2)
cbind(df[y$text,], y)

Does tidytext::unnest_tokens works with spanish characters?

4 Answers4

Linked