I am working on a language that is the not english and I have scraped the data from different sources. I have done my preprocessing like punctuation removal, stop-words removal and tokenization. Now I want to extract domain specific lexicons. Let's say that I have data related to sports, entertainment, etc and I want to extract words that are related to these particular fields, like cricket etc, and place them in topics that are closely related. I tried to use lda for this, but I am not getting the correct clusters. Also in the clusters in which a word which is a part of one topic, it also appears in other topics.
How can I improve my results?
# URDU STOP WORDS REMOVAL
doc_clean = []
stopwords_corpus = UrduCorpusReader('./data', ['stopwords-ur.txt'])
stopwords = stopwords_corpus.words()
# print(stopwords)
for infile in (wordlists.fileids()):
words = wordlists.words(infile)
#print(words)
finalized_words = remove_urdu_stopwords(stopwords, words)
doc = doc_clean.append(finalized_words)
print("\n==== WITHOUT STOPWORDS ===========\n")
print(finalized_words)
# making dictionary and corpus
dictionary = corpora.Dictionary(doc_clean)
# convert tokenized documents into a document-term matrix
matrx= [dictionary.doc2bow(text) for text in doc_clean]
# generate LDA model
lda = models.ldamodel.LdaModel(corpus=matrx, id2word=dictionary, num_topics=5, passes=10)
for top in lda.print_topics():
print("\n===topics from files===\n")
print (top)