Automatic labeling of LDA generated topics

Question

I'm trying to categorize customer feedback and I ran an LDA in python and got the following output for 10 topics:

(0, u'0.559*"delivery" + 0.124*"area" + 0.018*"mile" + 0.016*"option" + 0.012*"partner" + 0.011*"traffic" + 0.011*"hub" + 0.011*"thanks" + 0.010*"city" + 0.009*"way"')
(1, u'0.397*"package" + 0.073*"address" + 0.055*"time" + 0.047*"customer" + 0.045*"apartment" + 0.037*"delivery" + 0.031*"number" + 0.026*"item" + 0.021*"support" + 0.018*"door"')
(2, u'0.190*"time" + 0.127*"order" + 0.113*"minute" + 0.075*"pickup" + 0.074*"restaurant" + 0.031*"food" + 0.027*"support" + 0.027*"delivery" + 0.026*"pick" + 0.018*"min"')
(3, u'0.072*"code" + 0.067*"gps" + 0.053*"map" + 0.050*"street" + 0.047*"building" + 0.043*"address" + 0.042*"navigation" + 0.039*"access" + 0.035*"point" + 0.028*"gate"')
(4, u'0.434*"hour" + 0.068*"time" + 0.034*"min" + 0.032*"amount" + 0.024*"pay" + 0.019*"gas" + 0.018*"road" + 0.017*"today" + 0.016*"traffic" + 0.014*"load"')
(5, u'0.245*"route" + 0.154*"warehouse" + 0.043*"minute" + 0.039*"need" + 0.039*"today" + 0.026*"box" + 0.025*"facility" + 0.025*"bag" + 0.022*"end" + 0.020*"manager"')
(6, u'0.371*"location" + 0.110*"pick" + 0.097*"system" + 0.040*"im" + 0.038*"employee" + 0.022*"evening" + 0.018*"issue" + 0.015*"request" + 0.014*"while" + 0.013*"delivers"')
(7, u'0.182*"schedule" + 0.181*"please" + 0.059*"morning" + 0.050*"application" + 0.040*"payment" + 0.026*"change" + 0.025*"advance" + 0.025*"slot" + 0.020*"date" + 0.020*"tomorrow"')
(8, u'0.138*"stop" + 0.110*"work" + 0.062*"name" + 0.055*"account" + 0.046*"home" + 0.043*"guy" + 0.030*"address" + 0.026*"city" + 0.025*"everything" + 0.025*"feature"')

Is there a way to automatically label them? I do have a csv file which has feedbacks manually labeled, but I do not want to supply these labels myself. I want the model to create labels. Is it possible?

See a similar question [here](http://stackoverflow.com/questions/33921808/topic-modelling-assign-human-readable-labels-to-topic/33943104#33943104). — Ettore Rizza, May 19 '17 at 08:38
Possible duplicate of [Topic Modelling - Assign human readable labels to topic](http://stackoverflow.com/questions/33921808/topic-modelling-assign-human-readable-labels-to-topic) — Ettore Rizza, May 19 '17 at 08:39

score 1 · Answer 1 · answered Sep 04 '19 at 19:38

The comments here link to another SO answer that links to a paper. Let's say you wanted to do the minimum to try to make this work. Here is an MVP-style solution that has worked for me: search Google for the terms, then look for keywords in the response.

Here is some working, though hacky, code:

pip install cssselect

then

from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
from collections import Counter


def get_srp_text(search_term):
    raw = get(f"https://www.google.com/search?q={topic_terms}").text
    page = fromstring(raw)


    blob = ""

    for result in page.cssselect("a"):
        for res in result.findall("div"):
            blob += ' '
            blob += res.text if res.text else " "
            blob += ' '
    return blob


def blob_cleaner(blob):
    clean_blob = blob.replace(r'[\/,\(,\),\:,_,-,\-]', ' ')
    return ''.join(e for e in blob if e.isalnum() or e.isspace())


def get_name_from_srp_blob(clean_blob):
    blob_tokens = list(filter(bool, map(lambda x: x if len(x) > 2 else '', clean_blob.split(' '))))
    c = Counter(blob_tokens)
    most_common = c.most_common(10)

    name = f"{most_common[0][0]}-{most_common[1][0]}"
    return name

pipeline = lambda x: get_name_from_srp_blob(blob_cleaner(get_srp_text(x)))

Then you can just get the topic words from your model, e.g.

topic_terms = "delivery area mile option partner traffic hub thanks city way"

name = pipeline(topic_terms)
print(name)

>>> City-Transportation

and

topic_terms = "package address time customer apartment delivery number item support door"

name = pipeline(topic_terms)
print(name)

>>> Parcel-Package

You could improve this up a lot. For example, you could use POS tags to only find the most common nouns, then use those for the name. Or find the most common adjective and noun, and make the name "Adjective Noun". Even better, you could get the text from the linked sites, then run YAKE to extract keywords.

Regardless, this demonstrates a simple way to automatically name clusters, without directly using machine learning (though, Google is most certainly using it to generate the search results, so you are benefitting from it).

hello, I am stuck with similar problem and tried this solution. But I get an error: pipeline = lambda x: get_name_from_srp_blob(blob_cleaner(get_srp_text(x))) IndexError: list index out of range how can I fix this please? — user18334254, Aug 06 '22 at 14:01
@user18334254 - this is happening bc `get_srp_text` has to get text based on CSS on the Google SRP and that page's CSS changed. You will have to edit the block that starts `for result in page.cssselect("a"):` to work with the current CSS — Sam H., Aug 23 '22 at 21:05
thanks @Sam H. Could you guide me howI can understand about how this get_srp_text changed? I want to know how I should edit the block for result in page.cssselect("a"): please? — user18334254, Aug 26 '22 at 12:22
@user18334254 you would need to go to the Google search results page and inspect it and look at the XML, classes, etc. There are blog posts on it, some may have working code, but google changes the SRP often enough that code like this breaks frequently — Sam H., Aug 29 '22 at 22:11
Hi Sam, I used your code, but getting different names for topic_terms: insights-mediumcom and auspostcomau-help for the 1st and 2nd topic_terms. — Sam S., Sep 28 '22 at 01:32

Automatic labeling of LDA generated topics

1 Answers1