Cast topic modeling outcome to dataframe

Question

I have used BertTopic with KeyBERT to extract some topics from some docs

from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 3)
topics, probs = topic_model.fit_transform(docs)

Now I can access the topic name

freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(30)

   Topic    Count   Name
0   -1       1     -1_default_greenbone_gmp_manager
1    0      14      0_http_tls_ssl tls_ssl
2    1      8       1_jboss_console_web_application

and inspect the topics

[('http', 0.0855701486234524),          
 ('tls', 0.061977919455444744),
 ('ssl tls', 0.061977919455444744),
 ('ssl', 0.061977919455444744),
 ('tcp', 0.04551718585531556),
 ('number', 0.04551718585531556)]

[('jboss', 0.14014705432060262),
 ('console', 0.09285308122803233),
 ('web', 0.07323749337563096),
 ('application', 0.0622930523123512),
 ('management', 0.0622930523123512),
 ('apache', 0.05032395169459188)]

What I want is to have a final dataframe that has in one column the topic name and in another column the elements of the topic

expected outcome:

  class                         entities
o http_tls_ssl tls_ssl           HTTP...etc
1 jboss_console_web_application  JBoss, console, etc

and one dataframe with the topic name on different columns

  http_tls_ssl tls_ssl           jboss_console_web_application
o http                           JBoss
1 tls                            console
2 etc                            etc

I did not find out how to do this. Is there a way?

Laurent · Accepted Answer · 2022-12-20T17:42:03.840

Here is one way to to it:

Setup

import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]

topic_model = BERTopic()
# To keep the example reproducible in a reasonable time, limit to 3,000 docs
topics, probs = topic_model.fit_transform(docs[:3_000])

df = topic_model.get_topic_info()
print(df)
# Output
   Topic  Count                    Name
0     -1     23         -1_the_of_in_to
1      0   2635         0_the_to_of_and
2      1    114          1_the_he_to_in
3      2    103         2_the_to_in_and
4      3     59           3_ditto_was__
5      4     34  4_pool_andy_table_tell
6      5     32       5_the_to_game_and

First dataframe

Using Pandas string methods:

df = (
    df.rename(columns={"Name": "class"})
    .drop(columns=["Topic", "Count"])
    .reset_index(drop=True)
)

df["entities"] = [
    [item[0] if item[0] else pd.NA for item in topics]
    for topics in topic_model.get_topics().values()
]

df = df.loc[~df["class"].str.startswith("-1"), :]  # Remove -1 topic

df["class"] = df["class"].replace(
    "^-?\d+_", "", regex=True
)  # remove prefix '1_', '2_', ...

print(df)
# Output
                  class                                                      entities
1         the_to_of_and                [the, to, of, and, is, in, that, it, for, you]
2          the_he_to_in               [the, he, to, in, and, that, is, of, his, year]
3         the_to_in_and             [the, to, in, and, of, he, team, that, was, game]
4           ditto_was__  [ditto, was, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>]
5  pool_andy_table_tell  [pool, andy, table, tell, us, well, your, about, <NA>, <NA>]
6       the_to_game_and           [the, to, game, and, games, espn, on, in, is, have]

Second dataframe

Using Pandas transpose:

other_df = df.T.reset_index(drop=True)
new_col_labels = other_df.iloc[0]  # save first row
other_df = other_df[1:]  # remove first row
other_df.columns = new_col_labels
other_df = pd.DataFrame({col: other_df.loc[1, col] for col in other_df.columns})

print(other_df)
# Output
  the_to_of_and the_he_to_in the_to_in_and ditto_was__ pool_andy_table_tell the_to_game_and
0           the          the           the       ditto                 pool             the
1            to           he            to         was                 andy              to
2            of           to            in        <NA>                table            game
3           and           in           and        <NA>                 tell             and
4            is          and            of        <NA>                   us           games
5            in         that            he        <NA>                 well            espn
6          that           is          team        <NA>                 your              on
7            it           of          that        <NA>                about              in
8           for          his           was        <NA>                 <NA>              is
9           you         year          game        <NA>                 <NA>            have

the problem with this solution is that the entities columns entail only the terms that are used for the topics name and ignore the other words that entailed in the topics — xavi, Dec 19 '22 at 19:07
now i can see that you start from the topic name and you cast to a new column the words — xavi, Dec 19 '22 at 19:10
Hi @Laurent.Your solution is smart. I want to have all the terms that appear in a topic and not only the ones from the topic name. Thanks — xavi, Dec 20 '22 at 07:52

Cast topic modeling outcome to dataframe

1 Answers1

Setup

First dataframe

Second dataframe