Questions tagged [latent-semantic-indexing]

Latent semantic indexing is an indexing and retrieval method.

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A claimed feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.

52 questions
3
votes
0 answers

interpretation of SVD for text mining topic analysis

Background I'm learning about text mining by building my own text mining toolkit from scratch - the best way to learn! SVD The Singular Value Decomposition is often cited as a good way to: Visualise high dimensional data (word-document matrix) in…
3
votes
1 answer

scikit-learn - Should I fit model with TF or TF-IDF?

I am trying to find out the best way to fit different probabilistic models (like Latent Dirichlet Allocation, Non-negative Matrix Factorization, etc) on sklearn (Python). Looking at the example in the sklearn documentation, I was wondering why the…
3
votes
0 answers

categorize websites - open source LSI?

Im looking to categorize lots of websites (millions). I can use Nutch to crawl them and get the content of the sites, but I am looking for the best (and cheapest or free) tool to categorize them. One option is to create regular expressions that look…
Joelio
  • 4,621
  • 6
  • 44
  • 80
3
votes
1 answer

LSA - Feature selection

I have this SVD decomposition of the document I've read this page, but I don't understand how can I compute the best feature for document separation. I know that: S x Vt gives me relation between documents and features U x S gives me relation…
2
votes
0 answers

Unable to run gensims Distributed LSI

Problem description Unable to run gensims Distributed LSI due to this failed to initialize distributed LSI (Failed to locate the nameserver) Steps/code/corpus to reproduce from gensim.corpora import Dictionary from gensim.models import TfidfModel,…
Naga Budigam
  • 689
  • 1
  • 10
  • 26
2
votes
1 answer

Sklearn TruncatedSVD is not return n, components

I fitting an LSA model on TfIdf matrix. My original matrix has (20, 22096) then I'm applying TruncatedSVD to perform the LSI/Reduction svd = TruncatedSVD(n_components=200, random_state=42, n_iter=10) svdProfile =…
2
votes
2 answers

Topic Modelling: LDA , word frequency in each topic and Wordcloud

Question: How can I compute and code the frequency of words in each topic? My goal is to create 'Word Cloud' from each topic. P.S.> I have no problem with wordcloud. From the code, burnin <- 4000 #We do not collect this. iter <- 4000 thin…
2
votes
1 answer

"pre-built" matrices for latent semantic analysis

I want to use Latent Semantic Analysis for a small app I'm building, but I don't want to build up the matrices myself. (Partly because the documents I have wouldn't make a very good training collection, because they're kinda short and heterogeneous,…
grautur
  • 29,955
  • 34
  • 93
  • 128
2
votes
1 answer

Trying to make sense of Latent Semantic Indexing(LSI)

I am in the process of learning Singular Value Decomposition and for what purposes I can use this concept and the book that I am reading mentioned that SVD is used in Latent Semantic Indexing. I read few articles about LSI and it seems like LSI is…
Saik
  • 993
  • 1
  • 16
  • 40
2
votes
1 answer

gensim Generating LSI model causes "Python has stopped working"

So I am trying to use gensim to generate an LSI model along with corpus_lsi following this tutorial. I start with a corpus and a dictionary that I generated myself. The list of documents are too small (9 lines = 9 documents), which is the sample…
2
votes
1 answer

Gensim: ValueError: failed to create intent(cache|hide)|optional array-- must have defined dimensions but got (0,)

I am trying to emulate streaming for some documents and update the LSI on additional documents streamed-in. I find this error: Traceback (most recent call last): File "gensimStreamGen_tutorial5.py", line 57, in for vector in…
otayeby
  • 312
  • 8
  • 17
2
votes
2 answers

Problem for lsi

I am using Latent semantic analysis for text similarity. I have 2 questions. How to select K value for dimention reduction? I read alot every where that LSI work for similary meaning words for example car and automobile. How is it possible??? What…
user238384
  • 2,396
  • 10
  • 35
  • 36
2
votes
2 answers

User profiling for topic-based recommender system

I'm trying to come up with a topic-based recommender system to suggest relevant text documents to users. I trained a latent semantic indexing model, using gensim, on the wikipedia corpus. This lets me easily transform documents into the LSI topic…
1
vote
1 answer

Common Lisp implementation of Latent Semantic Indexing

Is there a free Common Lisp implementation of Latent Semantic Indexing available? I would like to integrate that capability into an existing Lisp system.
1
vote
4 answers

How does LDA give consistent results?

The popular topic model, Latent Dirichlet Allocation (LDA), which when used to extract topics from a corpus, returns different topics with different probability distributions over the dictionary words. Whereas Latent Semantic Indexing (LSI) gives…
Kai
  • 953
  • 6
  • 16
  • 37