0

I'm very new in this world of programming.

Ok so I am making an analysis of a text in R. I am using this to get rid of stop words:

kant_palavras <- kant_palavras %>% anti_join(get_stopwords(language = 'pt'))

BUT after, in the counting of words, the most common is "no". This is not useful for my analysis and I want to remove it, but I do not know how to do it.

I tried

kant_palavras <- kant_palavras %>% anti_join("no")

and

palavras_a_remover <- c("no") 

kant_palavras <- kant_palavras %>% anti_join(data.frame(palavra = palavras_a_remover))

and

palavras_a_remover <- c("no")

kant_palavras <- kant_palavras %>% 
  filter(!palavra %in% palavras_a_remover)

neither worked to get rid of that no!

--

full code before (all works):

dados_kant <- read.csv("kant2.csv")

dados_kant2 <- as_tibble(dados_kant)

Encoding(dados_kant2$texto.do.kant) <- "ASCII"

for (i in 1:nrow(dados_kant2))
{
  dados_kant2$texto.do.kant[i] <- iconv(dados_kant2$texto.do.kant[i], to = "ASCII//TRANSLIT")
}

kant_palavras <- dados_kant2 %>%  unnest_tokens(word, texto.do.kant)

kant_palavras <- kant_palavras %>% anti_join(get_stopwords(language = 'pt'))
margusl
  • 7,804
  • 2
  • 16
  • 20
  • 2
    Your attempts seem like they should have worked, which means that we'll need a reproducible example in order to debug further. What's in the kant_palavras object? It needs to be a data frame for anti_join to have any effect. Did the other stopwords get removed successfully? – Dubukay Aug 31 '23 at 22:43
  • I added the full code to the question! – philosophy Aug 31 '23 at 23:46
  • Could you share some of the data you are using (i.e. the contents of kant2.csv)? You could do this by copying the output of `dput(head(dados_kant, 50))` into your question. – Jay Bee Aug 31 '23 at 23:59
  • Of course! It all text. "> dput(head(dados_kant, 50)) structure(list(texto.do.kant = c("INTRODUÇÃO I — Da Distinção Entre o Conhecimento Puro e o Empírico Não se pode duvidar de que todos os nossos conhecimentos começam com a experiência, porque, com efeito, como haveria de exercitar-se a faculdade de se conhecer, se não fosse pelos objetos que (...) de vista geral de um sistema, deve", "ela com\"")), row.names = 1:2, class = "data.frame") – philosophy Sep 01 '23 at 01:09
  • Might be related to some encoding issue and for those it's not so easy to provide reproducible examples. Is `kant2.csv` available for download? And would you mind sharing your R version and platform details (win/mac/linux) ? – margusl Sep 01 '23 at 06:10

4 Answers4

1

You can do:

library(tidyverse)
kant_palavras <- kant_palavras %>%
  filter(!str_detect(texto.do.kant, '\\bno\\b'))

This would remove the entire row. If you only want to remove the word 'no', but keep the rest of the text, you can do:

kant_palavras <- kant_palavras %>%
  mutate(texto.do.kant = str_remove_all(texto.do.kant, '\\bno\\b'))
deschen
  • 10,012
  • 3
  • 27
  • 50
  • Thank you for your answer! But it didn't work... the message is: > kant_palavras <- kant_palavras %>% + filter(!str_detect(palavra , '\\bno\\b')) Error in `filter()`: ℹ In argument: `!str_detect(palavra, "\\bno\\b")`. Caused by error: ! object 'palavra' not found Run `rlang::last_trace()` to see where the error occurred. – philosophy Aug 31 '23 at 23:44
  • 1
    See my update, now that you have provided some data with probably correct column names. – deschen Sep 01 '23 at 05:33
1

I have an adapted version of the dput data you provided as df. I added some different variations of a 'no' value ('no', 'nono', 'No') so we can see what gets removed.

df <- structure(list(texto.do.kant = c("INTRODUÇÃO I — Da Distinção Entre o Conhecimento Puro e o Empírico Não se pode duvidar no de que todos os nossos conhecimentos começam com a experiência, nono porque, com efeito, como haveria de exercitar-se a faculdade de se conhecer, se não fosse pelos objetos que (...) de vista geral de um sistema, deve", "ela No no com\"")), row.names = 1:2, class = "data.frame")

And then:

library(tidyverse)
df2 <- str_remove(df$texto.do.kant, "\\bno\\b")

Which gives:

[1] "INTRODUÇÃO I — Da Distinção Entre o Conhecimento Puro e o Empírico Não se pode duvidar  de que todos os nossos conhecimentos começam com a experiência, nono porque, com efeito, como haveria de exercitar-se a faculdade de se conhecer, se não fosse pelos objetos que (...) de vista geral de um sistema, deve"
[2] "ela No  com\"" 

'no' is removed, while 'No' and 'nono' remain.

Jay Bee
  • 362
  • 1
  • 9
1

You are probably facing some encoding issue(s), with unicode text the

unnest_tokens(..., output = word)  %>% anti_join(get_stopwords(language = 'pt')) 

approach should behave as expected. It's not just some anti_join() thing, until text encoding issues are not dealt with, you can't really do any meaningful text processing/analysis.

To illustrate, here's a reproducible example with non-utf8 text as an input, we'll first try to detect encoding, convert it to utf8, split it to words and remove stopwords while checking the effect of (almost) every step:

library(dplyr)
library(readr)
library(stringi)
library(stringr)
library(tidytext)

# example text, non-unicode:
kant_txt <- read_file("http://www.filosofia.com.br/figuras/livros_inteiros/167.txt")
# detecting encoding:
stri_enc_detect(kant_txt)
#> [[1]]
#>       Encoding Language Confidence
#> 1 windows-1252       pt       0.81
#> 2 windows-1250       ro       0.35
#> 3 windows-1254       tr       0.17
#> 4     UTF-16BE                0.10
#> 5     UTF-16LE                0.10

# convert to Unicode and store in tibble:
kant_utf8 <- stri_encode(kant_txt, from = "windows-1252", to = "utf8")
kant <- tibble(title = "critica_da_razao_pura", text = kant_utf8)
kant
#> # A tibble: 1 × 2
#>   title                 text                                                    
#>   <chr>                 <chr>                                                   
#> 1 critica_da_razao_pura "Immanuel Kant – Crítica da Razão Pura\r\n\r\nProfessor…

# split text into tokens, default unit is word and by default 
# tokens are converted to lowercase:
kant_tokens <- unnest_tokens(kant, output = word, input = text)
# note dimensions, 199624 rows:
kant_tokens
#> # A tibble: 199,624 × 2
#>    title                 word      
#>    <chr>                 <chr>     
#>  1 critica_da_razao_pura immanuel  
#>  2 critica_da_razao_pura kant      
#>  3 critica_da_razao_pura crítica   
#>  4 critica_da_razao_pura da        
#>  5 critica_da_razao_pura razão     
#>  6 critica_da_razao_pura pura      
#>  7 critica_da_razao_pura professor 
#>  8 critica_da_razao_pura em        
#>  9 critica_da_razao_pura kõnigsberg
#> 10 critica_da_razao_pura membro    
#> # ℹ 199,614 more rows

# count words starting with "n", top 5:
kant_tokens %>% 
  filter(str_starts(word, "n")) %>% 
  summarise(count = n(), .by = word) %>% 
  arrange(desc(count)) %>% 
  print(n = 5)
#> # A tibble: 185 × 2
#>   word     count
#>   <chr>    <int>
#> 1 não       3355
#> 2 na        1383
#> 3 no        1311
#> 4 nos        562
#> 5 natureza   466
#> # ℹ 180 more rows

# drop stopwords:
kant_nostop <- anti_join(kant_tokens, get_stopwords(language = 'pt'))
#> Joining with `by = join_by(word)`
# keep an eye on changed row count:
kant_nostop
#> # A tibble: 112,540 × 2
#>    title                 word      
#>    <chr>                 <chr>     
#>  1 critica_da_razao_pura immanuel  
#>  2 critica_da_razao_pura kant      
#>  3 critica_da_razao_pura crítica   
#>  4 critica_da_razao_pura razão     
#>  5 critica_da_razao_pura pura      
#>  6 critica_da_razao_pura professor 
#>  7 critica_da_razao_pura kõnigsberg
#>  8 critica_da_razao_pura membro    
#>  9 critica_da_razao_pura academia  
#> 10 critica_da_razao_pura real      
#> # ℹ 112,530 more rows

# count words starting with "n" after stopwords are removed, top 5:
kant_nostop %>% 
  filter(str_starts(word, "n")) %>% 
  summarise(count = n(), .by = word) %>% 
  arrange(desc(count)) %>% 
  print(n = 5)

#> # A tibble: 172 × 2
#>   word        count
#>   <chr>       <int>
#> 1 natureza      466
#> 2 nada          331
#> 3 nenhuma       245
#> 4 nenhum        235
#> 5 necessidade   176
#> # ℹ 167 more rows

Created on 2023-09-01 with reprex v2.0.2

margusl
  • 7,804
  • 2
  • 16
  • 20
1

Let's start with this example, where you count up the word in Pride and Prejudice after removing stopwords:

library(tidyverse)
library(tidytext)

tibble(txt = janeaustenr::prideprejudice) |> 
  unnest_tokens(word, txt) |> 
  anti_join(get_stopwords()) |> 
  count(word, sort = TRUE)
#> Joining with `by = join_by(word)`
#> # A tibble: 6,404 × 2
#>    word          n
#>    <chr>     <int>
#>  1 mr          785
#>  2 elizabeth   597
#>  3 said        401
#>  4 darcy       373
#>  5 mrs         343
#>  6 much        326
#>  7 must        305
#>  8 bennet      294
#>  9 miss        283
#> 10 jane        264
#> # ℹ 6,394 more rows

Created on 2023-09-01 with reprex v2.0.2

But let's say you don't want to include those words "mr", "mrs", and "miss". If the list is short, I think I would use filter():

library(tidyverse)
library(tidytext)

tibble(txt = janeaustenr::prideprejudice) |> 
  unnest_tokens(word, txt) |> 
  anti_join(get_stopwords()) |>
  filter(!word %in% c("mr", "mrs", "miss")) |> 
  count(word, sort = TRUE)
#> Joining with `by = join_by(word)`
#> # A tibble: 6,401 × 2
#>    word          n
#>    <chr>     <int>
#>  1 elizabeth   597
#>  2 said        401
#>  3 darcy       373
#>  4 much        326
#>  5 must        305
#>  6 bennet      294
#>  7 jane        264
#>  8 one         263
#>  9 bingley     257
#> 10 know        236
#> # ℹ 6,391 more rows

Created on 2023-09-01 with reprex v2.0.2

You could also add them to a stopword lexicon, like this:

library(tidyverse)
library(tidytext)

my_custom_stopwords <-
  get_stopwords() |> 
  bind_rows(
    tibble(
      word = c("mr", "mrs", "miss"),
      lexicon = "custom"
    )
  )

tail(my_custom_stopwords)
#> # A tibble: 6 × 2
#>   word  lexicon 
#>   <chr> <chr>   
#> 1 too   snowball
#> 2 very  snowball
#> 3 will  snowball
#> 4 mr    custom  
#> 5 mrs   custom  
#> 6 miss  custom

Created on 2023-09-01 with reprex v2.0.2

Julia Silge
  • 10,848
  • 2
  • 40
  • 48