1

In text2vec package, I am using create_vocabulary function. For eg: My text is "This book is very good" and suppose I am not using stopwords and an ngram of 1L to 3L. so the vocab terms will be

This, book, is, very, good, This book,..... book is very, very good. I just want to remove the term "book is very" (and host of other terms using a vector). Since I just want to remove a phrase I cant use stopwords. I have coded the below code:

vocab<-create_vocabulary(it,ngram=c(1L,3L))
vocab_mod<- subset(vocab,!(term %in% stp) # where stp is stop phrases.

x<- read.csv(Filename') #these are all stop phrases
stp<-as.vector(x$term)

When I do the above step, the metainformation in attributes get lost in vocab_mod and so can't be used in create_dtm.

Sotos
  • 51,121
  • 6
  • 32
  • 66
tej kiran
  • 65
  • 1
  • 8

2 Answers2

1

It seems that subset function drops some attributes. You can try:

library(text2vec)
txt = "This book is very good"
it = itoken(txt)
v = create_vocabulary(it, ngram = c(1, 3))
v = v[!(v$term %in% "is_very_good"), ]    
v
# Number of docs: 1 
# 0 stopwords:  ... 
# ngram_min = 1; ngram_max = 3 
# Vocabulary: 
#   term term_count doc_count
# 1:         good          1         1
# 2: book_is_very          1         1
# 3:    This_book          1         1
# 4:         This          1         1
# 5:         book          1         1
# 6:    very_good          1         1
# 7:      is_very          1         1
# 8:      book_is          1         1
# 9: This_book_is          1         1
# 10:           is          1         1
# 11:         very          1         1
dtm = create_dtm(it, vocab_vectorizer(v))
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
0

@Dmitriy even this lets to drop the attributes... So the way out that I found was just adding the attributes manually for now using attr function

attr(vocab_mod,"ngram")<-c(ngram_min = 1L,ngram_max=3L) and son one for other attributes as well. We can get attribute details from vocab.

tej kiran
  • 65
  • 1
  • 8