How to remove single and double char tokens using quanteda::tokens_select()

Question

I am trying to remove single and double char tokens.

here is an example:

toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)

toks <- tokens_select(toks, min_nchar=1L, max_nchar=2L, selection = "remove")

toks

Results:

tokens from 1 document. text1 :

[1] "is" "a" "is" "a"

I expect to get the tokens that do not meet the criteria, instead of the ones that meet.

score 5 · Answer 1 · answered Feb 09 '19 at 17:36

5

library(quanteda)

toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)
tokens_select(toks, min_nchar=3L)

answered Feb 09 '19 at 17:36

Kipras Kančys

1,617
1
15
20

3

These also work: `tokens_remove(toks, c("?", "??"))` `tokens_keep(toks, min_nchar = 3)` – Ken Benoit Feb 10 '19 at 00:39

score 1 · Accepted Answer · answered Feb 09 '19 at 17:39

1

It looks like the selection argument is ignored.

This gives the results I wanted.

toks <- tokens_select(toks, min_nchar=3L, max_nchar=79L)

answered Feb 09 '19 at 17:39

ronencozen

1,991
1
15
26

YOLO · Answer 3 · 2019-02-09T17:33:57.997

-1

You need to convert the given sentence into tokens. You can do the following:

library(quanteda)

# convert to tokens
tokens <- unlist(tokens(sent, remove_punct = T), use.names=F)

# to remove tokens with <= 2 characters
Filter(function(x) nchar(x) > 2, tokens)

[1] "This"     "sentence" "This"     "second"   "sentence"

edited Feb 09 '19 at 17:33

answered Feb 09 '19 at 17:16

YOLO

20,181
5
20
40

How to remove single and double char tokens using quanteda::tokens_select()

3 Answers3