Computing Trending Topics

Question

Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.

Is it possible to write a script to do something like this PHP and mysql?

I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?

Are you looking for a way to pool together topics that are sorted in a stored-group known as trends? — Anthony Forloney, Feb 11 '10 at 21:12
Basically a keyword phrase can be 1-3 words in length. So if "Michael Jackson" is a popular topic, it should recognize that "Michael Jackson" is a single keyword phrase. Not "Michael" and "Jackson" as separate popular keywords. Is that clear at all? — Brian, Feb 11 '10 at 21:18

score 2 · Answer 1 · answered Apr 25 '11 at 12:46

trending topic receipt from me :
1. fetch the tweets
2. split each tweets by space into n-gram (up to 3 gram if you want 3 words length) array
3. filter out each array from url, @username, common words and junk chars
4. count all unique keyword / phrase frequency
5. mute some junk word / phrase

yes, you can do it on php & mysql ;)

score 1 · Answer 2 · answered Feb 11 '10 at 21:29

1

How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ? Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.

You might also want to add some kind of dictionary of words you don't want to count

answered Feb 11 '10 at 21:29

Dominik

1,194
6
9

Do you have any suggestions for doing this efficiently. This seems like a pretty good idea. – Brian Feb 11 '10 at 21:35
i second the request in the comment. there is a serious lack of info on this topic on the web currently – ChuckKelly Sep 04 '13 at 01:27

score 1 · Answer 3 · answered Feb 11 '10 at 21:31

1

What you need is either

document classification, or..
automatic tagging

Probably second one. And only then you can count their popularity in time.

answered Feb 11 '10 at 21:31

Artjom Kurapov

6,115
4
32
42

score 0 · Answer 4 · answered Feb 11 '10 at 21:34

Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.

It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.

To go beyond that, you need natural language processing tools to determine the meaning of what is said.

Computing Trending Topics

4 Answers4

Linked