4

I want to use GSDMM to assign topics to some tweets in my data set. The only examples I found (1 and 2) are not detailed enough. I was wondering if you know of a source (or care enough to make a small example) that shows how GSDMM is implemented using python.

Pie-ton
  • 550
  • 4
  • 17

3 Answers3

9

I finally compiled my code for GSDMM and will put it here from scratch for others' use. I have tried to comment on important parts:

# Imports
import random

import numpy as np
from gensim.models.phrases import Phraser, Phrases
from gensim.utils import simple_preprocess
from gsdmm import MovieGroupProcess


# data
data = ...

# stop words
stop_words = ...

# turning sentences into words
data_words =[]
for doc in data:
    doc = doc.split()
    data_words.append(doc)

# create vocabulary
vocabulary = ...

# Removing stop Words
stop_words.extend(['from', 'rt'])

def remove_stopwords(texts):
    return [
        [
            word
            for word in simple_preprocess(str(doc))
            if word not in stop_words
        ]
        for doc in texts
    ]

data_words_nostops = remove_stopwords(vocabulary)

# building bi-grams 
bigram = Phrases(vocabulary, min_count=5, threshold=100) 
bigram_mod = Phraser(bigram)
print('done!')

# Form Bigrams
data_words_bigrams = [bigram_mod[doc] for doc in data_words_nostops]

# lemmatization
pos_to_use = ['NOUN', 'ADJ', 'VERB', 'ADV']
data_lemmatized = []
for sent in data_words_bigrams:
    doc = nlp(" ".join(sent)) 
    data_lemmatized.append(
        [token.lemma_ for token in doc if token.pos_ in pos_to_use]
    )
      
docs = data_lemmatized
vocab = set(x for doc in docs for x in doc)

# Train a new model 
random.seed(1000)
# Init of the Gibbs Sampling Dirichlet Mixture Model algorithm
mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)

vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
n_docs = len(docs)

# Fit the model on the data given the chosen seeds
y = mgp.fit(docs, n_terms)

def top_words(cluster_word_distribution, top_cluster, values):
    for cluster in top_cluster:
        sort_dicts = sorted(
            mgp.cluster_word_distribution[cluster].items(),
            key=lambda k: k[1],
            reverse=True,
        )[:values]
        print('Cluster %s : %s'%(cluster,sort_dicts))
        print(' — — — — — — — — — ')

doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)

# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-10:][::-1]
print('Most important clusters (by number of docs inside):', top_index)
print('*'*20)


# Show the top 10 words in term frequency for each cluster 
top_words(mgp.cluster_word_distribution, top_index, 10)


Links

  1. gensim modules
  2. Python library gsdmm
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Pie-ton
  • 550
  • 4
  • 17
0

GSDMM (Gibbs Sampling Dirichlet Multinomial Mixture) is a short text clustering model. It is essentially a modified LDA (Latent Drichlet Allocation) which suppose that a document such as a tweet or any other text encompasses one topic.

GSDMM

enter image description here LDA

Address: github.com/da03/GSDMM

import numpy as np
from scipy.sparse import lil_matrix
from scipy.sparse import find
import math

class GSDMM:
    def __init__(self, n_topics, n_iter, random_state=910820, alpha=0.1, beta=0.1):
        self.n_topics = n_topics
        self.n_iter = n_iter
        self.random_state = random_state
        np.random.seed(random_state)
        self.alpha = alpha
        self.beta = beta
    def fit(self, X):
        alpha = self.alpha
        beta = self.beta

        D, V = X.shape
        K = self.n_topics

        N_d = X.sum(axis=1)
        words_d = {}
        for d in range(D):
            words_d[d] = find(X[d,:])[1]

        # initialization
        N_k = np.zeros(K)
        M_k = np.zeros(K)
        N_k_w = lil_matrix((K, V), dtype=np.int32)

        K_d = np.zeros(D)

        for d in range(D):
            k = np.random.choice(K, 1, p=[1.0/K]*K)[0]
            K_d[d] = k
            M_k[k] = M_k[k]+1
            N_k[k] = N_k[k] + N_d[d]
            for w in words_d[d]:
                N_k_w[k, w] = N_k_w[k,w]+X[d,w]

        for iter in range(self.n_iter):
            print 'iter ', iter
            for d in range(D):
                k_old = K_d[d]
                M_k[k_old] -= 1
                N_k[k_old] -= N_d[d]
                for w in words_d[d]:
                    N_k_w[k_old, w] -= X[d,w]
                # sample k_new
                log_probs = [0]*K
                for k in range(K):
                    log_probs[k] += math.log(alpha+M_k[k])
                    for w in words_d[d]:
                        N_d_w = X[d,w]
                        for j in range(N_d_w):
                            log_probs[k] += math.log(N_k_w[k,w]+beta+j)
                    for i in range(N_d[d]):
                        log_probs[k] -= math.log(N_k[k]+beta*V+i)
                log_probs = np.array(log_probs) - max(log_probs)
                probs = np.exp(log_probs)
                probs = probs/np.sum(probs)
                k_new = np.random.choice(K, 1, p=probs)[0]
                K_d[d] = k_new
                M_k[k_new] += 1
                N_k[k_new] += N_d[d]
                for w in words_d[d]:
                    N_k_w[k_new, w] += X[d,w]
        self.topic_word_ = N_k_w.toarray()
Mahsa Hassankashi
  • 2,086
  • 1
  • 15
  • 25
  • Thanks, but I was more looking for some practical examples like what you normally might see on medium or towarddatascience. There so many on LDA, but very very few for GSDMM – Pie-ton May 30 '20 at 21:34
  • I sent you a proper link, if they did not enough let me know. – Mahsa Hassankashi May 30 '20 at 21:39
  • 1
    Thanks, for that. I need a tutorial of how GSDMM is applied to assign topics to short texts using python. For example, what code must be written and how GSDMM package should be used (how to adjust the alpha and beta values, etc) to arrive at the final answer. – Pie-ton May 30 '20 at 21:45
  • I understood what you want, so as I mentioned there is **LDA** which is the mother of **GSDMM** so let's start with many samples by LDA : https://www.kaggle.com/search?q=Lda you can see how alpha and beta would be adjusted there and so on. – Mahsa Hassankashi May 30 '20 at 21:58
0

As I understand it you have the code https://github.com/rwalk/gsdmm but you need to decide how to apply it.

How does it work?

You can download the paper A dirichlet multinomial mixture model-based approach for short text clustering, it shows that the clusters search is equivalent to game of table choosing. Image to have a group of students and want to group them on tables by their movie interest. Every student (=item) switches in each round to a table(=cluster) that has students with similar movies and is popular. Alpha controls a factor that decides how easily a table gets removed when it's empty (low alpha = less tables). Small betas means that a table is chosen based on similarity to the table than based on popularity of a table. For short text clustering you take word instead of movies.

Alpha, beta, number of iterations

Therefore low alpha result in many clusters with single words, while high alphas result in less clusters with more words. High beta result in popular clusters while low beta result in similar clusters (which are not strong populated). What parameteres you need is based on the dataset. The number of clusters can mostly controlled by beta, but alpha has also (as described) an influence. The number of iterations seems to be stable after 20 iterations but 10 is also ok.

Data preperation process

Before you train the algorithm you will need to create a clean data set. For this you convert every text to lower case, you remove non-ASCII characters and stop-words and you apply stemming or lemmatisation. You will also need to apply this process when you execute it on a new sample.

simsi
  • 533
  • 3
  • 16