Bilingual Latent Dirichlet Allocation into [a Modified] K-Means Clustering Algorithm

Question

I have a thesis paper that focuses on using Bilingual LDA and a modified version (modified for runtime) of K-Means for Sentiment Analysis (using Multinomial NB) on Filipino and English COVID-19 Tweets.

I have the files that came from my Bi-LDA from https://github.com/1991wzc/python-LDA-and-BiLDA/blob/master/BiLDA.py which made text files like theta values, phi values, topic of words and a wordmap (photo of the files attached) ![Bi-LDA outputs] (https://i.stack.imgur.com/7On0l.png) from the tweets that were tokenized and lemmatized, basically preprocessed, however, I cannot seem to apply K-Means that came from my Bi-LDA files since I do not know what to do next.

I will also attach the Google Colab of the .ipynb file so you can see what is needed to be put for K-Means: https://colab.research.google.com/drive/1FE4WkG-cEe1SPmFm49Z6ovA7oREg17VT?usp=sharing

Thank you so much, a little help will surely make a difference for me and my group.

What I did before running the Bi-LDA algorithm is to change the value of K which is equal to the value of K in my K-Means so that it only needs six (6) values once each topic is clustered.

I do not know what to expect, really since I don't know which values are to be put in a K-Means algorithm, but I am expecting after K-Means, I can get to statistical treatments.

score 0 · Answer 1 · answered Nov 02 '22 at 16:22

I'm not sure if I could exactly answer your question, but I suppose you have to apply your data using K-Means algorithm.

Based from your source code, I've noticed the following:

You are tokenizing your tokens by using .split(). Your code could save a lot of lines and list comprehensions by using libraries such as nltk's tokenizer. Since you are tokenizing tweets, you can from nltk.tokenize import TweetTokenizer. Similar steps could be done on lemmatization.
Your output files are unstructured, you need to structure it in a way that it could be easily understood. You can store it in CSV. An alternative way is to:
Store the processed data using Pandas' Dataframe. That way you'll have access to a table-like structure where you can modify the table they way you want.
Instead of coding your own K-Means algorithm, you can instead import the algorithm KMeans from sklearn.cluster.
Now you can obtain statistical treatments such as scores, cluster centers. You can also obtain treatments from numpy
(Addtnl) If you want to visualize your data, you can do so by using packages such as matplotlib or seaborn

I hope this answer would be helpful for your thesis. I just saw your Facebook post from a data science group.

Bilingual Latent Dirichlet Allocation into [a Modified] K-Means Clustering Algorithm

1 Answers1