-1

I have a Document-Features-Matrix (DFM): I want to convert it into a LSA object and finally to compute cosine similarity between each documents.

this are the passages I followed

lsa_t2 <- convert(DFM_tfidf, to = "lsa" , omit_empty = TRUE)
t2_lsa_tfidf_cos_sim = sim2(x = lsa_t2, method = "cosine", norm = "l2")

but I get this error:

Error in sim2(x = lsa_t2, method = "cosine", norm = "l2") :
inherits(x, "matrix") || inherits(x, "Matrix") is not TRUE

to give more context this is what las_t2 looks like

How lsa_t2 looks like

any of the documents contain text (I check it already) and I filtered outdocuments without text before I cleated the dfm.

Any idea of what happened?

Carbo
  • 906
  • 5
  • 23
  • 1
    Please create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with one of the datasets available in quanteda. Also mention where the function `sim2` comes from. – phiver Feb 18 '20 at 15:15

1 Answers1

1

The error you receive basically means that the function sim2 does not work with the lsa object. However, I'm not really sure if I understand the question. Why do you want to convert the dfm to lsa textmatrix format in the first place?

If you want to calculate cosine similarity between texts, you can do this directly in quenteda

library(quanteda)

texts <- c(d1 = "Shipment of gold damaged in a fire",
           d2 = "Delivery of silver arrived in a silver truck",
           d3 = "Shipment of gold arrived in a truck" )

texts_dfm <- dfm(texts)

textstat_simil(texts_dfm, 
               margin = "documents",
               method = "cosine")
#> textstat_simil object; method = "cosine"
#>       d1    d2    d3
#> d1 1.000 0.359 0.714
#> d2 0.359 1.000 0.598
#> d3 0.714 0.598 1.000

If you want to use sim2 from text2vec, you can do so using the same object without converting it first:

library(text2vec)
sim2(x = texts_dfm, method = "cosine", norm = "l2")
#> 3 x 3 sparse Matrix of class "dsCMatrix"
#>           d1        d2        d3
#> d1 1.0000000 0.3585686 0.7142857
#> d2 0.3585686 1.0000000 0.5976143
#> d3 0.7142857 0.5976143 1.0000000

As you can see, the results are the same.

Update

As by the comments, I now understand that you want to apply a transformation of your data via Latent semantic analysis. You can follow the tutorial linked below and plug in the dfm instead of the dtm that is used in the tutorial:

texts_dfm_tfidf <- dfm_tfidf(texts_dfm)


library(text2vec)
lsa = LSA$new(n_topics = 2)
dtm_tfidf_lsa = fit_transform(texts_dfm_tfidf, lsa) # I get a warning here, probably due to the size of the toy dfm
d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf_lsa, method = "cosine", norm = "l2")
d1_d2_tfidf_cos_sim
#>              d1           d2        d3           d4
#> d1  1.000000000 -0.002533794 0.5452992  0.999996189
#> d2 -0.002533794  1.000000000 0.8368571 -0.005294431
#> d3  0.545299245  0.836857086 1.0000000  0.542983071
#> d4  0.999996189 -0.005294431 0.5429831  1.000000000

Note that these results differ from run to run unless you use set.seed().

Or if you want to do everything in quanteda:

texts_lsa <- textmodel_lsa(texts_dfm_tfidf, 2)

textstat_simil(as.dfm(texts_lsa$docs), 
               margin = "documents",
               method = "cosine")
#> textstat_simil object; method = "cosine"
#>          d1       d2    d3       d4
#> d1  1.00000 -0.00684 0.648  1.00000
#> d2 -0.00684  1.00000 0.757 -0.00894
#> d3  0.64799  0.75720 1.000  0.64638
#> d4  1.00000 -0.00894 0.646  1.00000
JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • the thing is that I wanted to try with a LSA representation of my documents before applying cosine similarity to compare the results I obtained using cosine similarity on my dim (with tf-idf) – Carbo Feb 18 '20 at 15:49
  • Correct me if I'm wrong but the `textmatrix` object you get using `convert(DFM_tfidf, to = "lsa")` is just a different format than `dfm`. Both are basically document-term matrices that have the content and dimensions and are just technically different. The transformation therefore seems pointless to me. Do you maybe have a different representation of your data in mind? – JBGruber Feb 18 '20 at 16:02
  • http://text2vec.org/similarity.html#practical_examples I was following this link example using my own data ( I am referring to the paragraph "Cosine similarity with LSA"). the author suggests tf-idf might not be good enough and tries using lsa – Carbo Feb 18 '20 at 16:11