2

I am trying to remove single and double char tokens.

here is an example:

toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)

toks <- tokens_select(toks, min_nchar=1L, max_nchar=2L, selection = "remove")

toks

Results:

tokens from 1 document. text1 :

[1] "is" "a" "is" "a"

I expect to get the tokens that do not meet the criteria, instead of the ones that meet.

ronencozen
  • 1,991
  • 1
  • 15
  • 26

3 Answers3

5
library(quanteda)

toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)
tokens_select(toks, min_nchar=3L)
Kipras Kančys
  • 1,617
  • 1
  • 15
  • 20
1

It looks like the selection argument is ignored.

This gives the results I wanted.

toks <- tokens_select(toks, min_nchar=3L, max_nchar=79L)

ronencozen
  • 1,991
  • 1
  • 15
  • 26
-1

You need to convert the given sentence into tokens. You can do the following:

library(quanteda)

# convert to tokens
tokens <- unlist(tokens(sent, remove_punct = T), use.names=F)

# to remove tokens with <= 2 characters
Filter(function(x) nchar(x) > 2, tokens)

[1] "This"     "sentence" "This"     "second"   "sentence"
YOLO
  • 20,181
  • 5
  • 20
  • 40