The stopwords are working just fine, however the default Snowball list of French stopwords simply does not include the words you wish to remove.
You can see that by inspecting the vector of stopwords returned by stopwords("fr")
:
library("quanteda")
## Package version: 2.1.2
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr")
## [1] FALSE FALSE FALSE FALSE FALSE
This is the full list of words:
sort(stopwords("fr"))
## [1] "à" "ai" "aie" "aient" "aies" "ait"
## [7] "as" "au" "aura" "aurai" "auraient" "aurais"
## [13] "aurait" "auras" "aurez" "auriez" "aurions" "aurons"
## [19] "auront" "aux" "avaient" "avais" "avait" "avec"
## [25] "avez" "aviez" "avions" "avons" "ayant" "ayez"
## [31] "ayons" "c" "ce" "ceci" "cela" "celà"
## [37] "ces" "cet" "cette" "d" "dans" "de"
## [43] "des" "du" "elle" "en" "es" "est"
## [49] "et" "étaient" "étais" "était" "étant" "été"
## [55] "étée" "étées" "étés" "êtes" "étiez" "étions"
## [61] "eu" "eue" "eues" "eûmes" "eurent" "eus"
## [67] "eusse" "eussent" "eusses" "eussiez" "eussions" "eut"
## [73] "eût" "eûtes" "eux" "fûmes" "furent" "fus"
## [79] "fusse" "fussent" "fusses" "fussiez" "fussions" "fut"
## [85] "fût" "fûtes" "ici" "il" "ils" "j"
## [91] "je" "l" "la" "le" "les" "leur"
## [97] "leurs" "lui" "m" "ma" "mais" "me"
## [103] "même" "mes" "moi" "mon" "n" "ne"
## [109] "nos" "notre" "nous" "on" "ont" "ou"
## [115] "par" "pas" "pour" "qu" "que" "quel"
## [121] "quelle" "quelles" "quels" "qui" "s" "sa"
## [127] "sans" "se" "sera" "serai" "seraient" "serais"
## [133] "serait" "seras" "serez" "seriez" "serions" "serons"
## [139] "seront" "ses" "soi" "soient" "sois" "soit"
## [145] "sommes" "son" "sont" "soyez" "soyons" "suis"
## [151] "sur" "t" "ta" "te" "tes" "toi"
## [157] "ton" "tu" "un" "une" "vos" "votre"
## [163] "vous" "y"
That's why they are not removed. We can see this with an example I created, using many of your words:
toks <- tokens("Je veux avoir une glace et être heureux, comme un enfant avant le dîner.",
remove_punct = TRUE
)
tokens_remove(toks, stopwords("fr"))
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "avoir" "glace" "être" "heureux" "comme" "enfant"
## [8] "avant" "dîner"
How to remove them? Either use a more complete list of stopwords, or customize the Snowball list by appending the stopwords you want to the existing ones.
mystopwords <- c(stopwords("fr"), "comme", "avoir", "plus", "avant", "être")
tokens_remove(toks, mystopwords)
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "glace" "heureux" "enfant" "dîner"
You could also use one of the other stopword sources, such as the "stopwords-iso", which does contain all of the words you wish to remove:
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr", source = "stopwords-iso")
## [1] TRUE TRUE TRUE TRUE TRUE
With regard to the language question, see the help for ?stopwords::stopwords
, which states:
The language codes for each stopword list use the two-letter ISO code from https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes. For backwards compatibility, the full English names of the stopwords from the quanteda package may also be used, although these are deprecated.
With regard to what you tried with stringi::stri_trans_general(x, "Latin-ASCII")
, this would only help you if you wanted to remove "etre" and your stopword list contained only "être". In the example below, the stopword vector containing the accented character is concatenated with a version of itself in which the accents have been removed.
sw <- "être"
tokens("etre être heureux") %>%
tokens_remove(sw)
## Tokens consisting of 1 document.
## text1 :
## [1] "etre" "heureux"
tokens("etre être heureux") %>%
tokens_remove(c(sw, stringi::stri_trans_general(sw, "Latin-ASCII")))
## Tokens consisting of 1 document.
## text1 :
## [1] "heureux"
c(sw, stringi::stri_trans_general(sw, "Latin-ASCII"))
## [1] "être" "etre"