0

I have a matrix tf.m NxM and and data frame df with N rows.
I want to assing row n of the matrix to a column in the data frame, at the same row n.

library("tm")
ftfidf <- function(text.d) {
  txt <- VectorSource(text.d);
  txt.corpus <- VCorpus(txt, readerControl = list(reader = readPlain,    language = "en"));
 revs <- tm_map(txt.corpus, content_transformer(tolower)) 
 dtm <- DocumentTermMatrix(revs, control = list(weighting = function(x)   weightTfIdf(x, normalize = T),stopwords = TRUE))
}

df<-data.frame(id=c("doc1", "doc2", "doc3"), text=c("hello world", "people people", "happy people"))
#id          text
#1 doc1   hello world
#2 doc2 people people
#3 doc3  happy people
tf <- ftfidf(df$text) # a function that gets a DocumentTermMatrix
tf.m <- as.matrix(tf)
#Terms
#Docs     happy     hello    people     world
#1 0.0000000 0.7924813 0.0000000 0.7924813
#2 0.0000000 0.0000000 0.5849625 0.0000000
#3 0.7924813 0.0000000 0.2924813 0.0000000

If I run this, I get 4 more columns in the data frame

df$tf<-tf.m
#id          text  tf.happy  tf.hello tf.people  tf.world
#1 doc1   hello world 0.0000000 0.7924813 0.0000000 0.7924813
#2 doc2 people people 0.0000000 0.0000000 0.5849625 0.0000000
#3 doc3  happy people 0.7924813 0.0000000 0.2924813 0.0000000

I would like to have this:

#id          text       tf
#1 doc1   hello world   happy     hello    people     world
#                       0.0000000 0.7924813 0.0000000 0.7924813
#2 doc2 people people   happy     hello    people     world
#                       0.0000000 0.0000000 0.5849625 0.0000000
#2 doc3 happy people   happy     hello    people     world
#                       0.7924813 0.0000000 0.2924813 0.0000000

to try to train a knn based on term frequency df$tf (if possible)

 knn_model <- knn(train = df$tf[1,], cl = df$id, k=3)

to query for the nearest-neighbors of a df$id.
My goal is to run this 'like' python graphlab function in R:

knn_model = graphlab.nearest_neighbors.create(df,features=['tf'],label='id')
Stefano Piovesan
  • 1,185
  • 3
  • 19
  • 37

1 Answers1

0

It looks like you want to have hierarchical indices. To my knowledge, there is no clear cut way to do this in R. Data.table allows assignment of keys, but are not true indices as they are part of the data, in contrast to python pandas where metadata (index) and data are decoupled. I assume this from the expression df$tf[1,], which should raise an error on dimensions if df is a data.frame.

My experience from R is that data like this is expected in most occasions to be represented in long format, ie.

id   text          tf    value
doc1 hello world  happy  0.0000000
doc1 hello world  hello  0.7924813
doc1 hello world  people 0.0000000
doc1 hello world  world  0.7924813

this can be achieved with melt functions in various packages. Sometimes you ned to have only one variable and one value column. In that case the interaction function is helpful to compose the variable.

Hope this helps and that I understood your question, eager myself to find out if true indices exist in R.

Love Tätting
  • 221
  • 2
  • 6