You are probably facing some encoding issue(s), with unicode text the
unnest_tokens(..., output = word) %>% anti_join(get_stopwords(language = 'pt'))
approach should behave as expected. It's not just some anti_join()
thing, until text encoding issues are not dealt with, you can't really do any meaningful text processing/analysis.
To illustrate, here's a reproducible example with non-utf8 text as an input, we'll first try to detect encoding, convert it to utf8, split it to words and remove stopwords while checking the effect of (almost) every step:
library(dplyr)
library(readr)
library(stringi)
library(stringr)
library(tidytext)
# example text, non-unicode:
kant_txt <- read_file("http://www.filosofia.com.br/figuras/livros_inteiros/167.txt")
# detecting encoding:
stri_enc_detect(kant_txt)
#> [[1]]
#> Encoding Language Confidence
#> 1 windows-1252 pt 0.81
#> 2 windows-1250 ro 0.35
#> 3 windows-1254 tr 0.17
#> 4 UTF-16BE 0.10
#> 5 UTF-16LE 0.10
# convert to Unicode and store in tibble:
kant_utf8 <- stri_encode(kant_txt, from = "windows-1252", to = "utf8")
kant <- tibble(title = "critica_da_razao_pura", text = kant_utf8)
kant
#> # A tibble: 1 × 2
#> title text
#> <chr> <chr>
#> 1 critica_da_razao_pura "Immanuel Kant – Crítica da Razão Pura\r\n\r\nProfessor…
# split text into tokens, default unit is word and by default
# tokens are converted to lowercase:
kant_tokens <- unnest_tokens(kant, output = word, input = text)
# note dimensions, 199624 rows:
kant_tokens
#> # A tibble: 199,624 × 2
#> title word
#> <chr> <chr>
#> 1 critica_da_razao_pura immanuel
#> 2 critica_da_razao_pura kant
#> 3 critica_da_razao_pura crítica
#> 4 critica_da_razao_pura da
#> 5 critica_da_razao_pura razão
#> 6 critica_da_razao_pura pura
#> 7 critica_da_razao_pura professor
#> 8 critica_da_razao_pura em
#> 9 critica_da_razao_pura kõnigsberg
#> 10 critica_da_razao_pura membro
#> # ℹ 199,614 more rows
# count words starting with "n", top 5:
kant_tokens %>%
filter(str_starts(word, "n")) %>%
summarise(count = n(), .by = word) %>%
arrange(desc(count)) %>%
print(n = 5)
#> # A tibble: 185 × 2
#> word count
#> <chr> <int>
#> 1 não 3355
#> 2 na 1383
#> 3 no 1311
#> 4 nos 562
#> 5 natureza 466
#> # ℹ 180 more rows
# drop stopwords:
kant_nostop <- anti_join(kant_tokens, get_stopwords(language = 'pt'))
#> Joining with `by = join_by(word)`
# keep an eye on changed row count:
kant_nostop
#> # A tibble: 112,540 × 2
#> title word
#> <chr> <chr>
#> 1 critica_da_razao_pura immanuel
#> 2 critica_da_razao_pura kant
#> 3 critica_da_razao_pura crítica
#> 4 critica_da_razao_pura razão
#> 5 critica_da_razao_pura pura
#> 6 critica_da_razao_pura professor
#> 7 critica_da_razao_pura kõnigsberg
#> 8 critica_da_razao_pura membro
#> 9 critica_da_razao_pura academia
#> 10 critica_da_razao_pura real
#> # ℹ 112,530 more rows
# count words starting with "n" after stopwords are removed, top 5:
kant_nostop %>%
filter(str_starts(word, "n")) %>%
summarise(count = n(), .by = word) %>%
arrange(desc(count)) %>%
print(n = 5)
#> # A tibble: 172 × 2
#> word count
#> <chr> <int>
#> 1 natureza 466
#> 2 nada 331
#> 3 nenhuma 245
#> 4 nenhum 235
#> 5 necessidade 176
#> # ℹ 167 more rows
Created on 2023-09-01 with reprex v2.0.2