0

This is my first attempt with Natural Language Processing so I started with Latent Semantic Analysis and used this tutorial to build the algorithm. After testing it I see that it only classifies the first semantic words and repeats the same terms over and over on top of the other documents.

I tried feeding it the documents found in HERE too and it does exactly the same. Repeating the values of the same topic several times in the other ones.

Could anyone help explain what is happening? I've been searching all over and everything seems exactly like in the tutorials.

testDocs = [
"The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss",
                ]
    stopwords = ['and','edition','for','in','little','of','the','to']
    ignorechars = ''',:'!'''

    #First we apply the standard SKLearn algorithm to compare with.
    for element in testDocs:
        #tokens.append(tokenizer.tokenize(element.lower()))
        element = element.lower()

        print(testDocs)

    #Vectorize the features.
    vectorizer = tfdv(max_df=0.5, min_df=2, max_features=8, stop_words='english', use_idf=True)#, ngram_range=(1,3))
    #Store the values in matrix X.
    X = vectorizer.fit_transform(testDocs)
#Apply LSA.
    lsa = TruncatedSVD(n_components=3, n_iter=100)
    lsa.fit(X)

    #Get a list of the terms in the order it was decomposed.
    terms = vectorizer.get_feature_names()
    print("Terms decomposed from the document: " + str(terms))
    print()

    #Prints the matrix of concepts. Each number represents how important the term is to the concept and the position relates to the position of the term.
    print("Number of components in element 0 of matrix of components:")
    print(lsa.components_[0])
    print("Shape: " + str(lsa.components_.shape))
    print()
    for i, comp in enumerate(lsa.components_):
        #Stick each of the terms to the respective components. Zip command creates a tuple from 2 components.
        termsInComp = zip(terms, comp)
        #Sort the terms according to...
        sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True)
        print("Concept %d", i)
        for term in sortedTerms:
            print(term[0], end="\t")
        print()
  • One could argue that there is no problem to begin with. In the code, *sortedTerms* is the set of topics. The objective of topic modelling is to identify the topics from a corpus and it does exactly that. – SidharthMacherla Apr 16 '20 at 01:24
  • Thanks a lot for your answer! Though I tried to test the same algorithm with gensim and it gives me more variation in the words after applying LsiModel on the same bag of words. – Faust Alexander Apr 16 '20 at 01:44

0 Answers0