5

Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.

Is it possible to write a script to do something like this PHP and mysql?

I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?

Joe Doyle
  • 6,363
  • 3
  • 42
  • 45
Brian
  • 953
  • 4
  • 15
  • 35
  • Are you looking for a way to pool together topics that are sorted in a stored-group known as trends? – Anthony Forloney Feb 11 '10 at 21:12
  • Basically a keyword phrase can be 1-3 words in length. So if "Michael Jackson" is a popular topic, it should recognize that "Michael Jackson" is a single keyword phrase. Not "Michael" and "Jackson" as separate popular keywords. Is that clear at all? – Brian Feb 11 '10 at 21:18

4 Answers4

2

trending topic receipt from me :
1. fetch the tweets
2. split each tweets by space into n-gram (up to 3 gram if you want 3 words length) array
3. filter out each array from url, @username, common words and junk chars
4. count all unique keyword / phrase frequency
5. mute some junk word / phrase

yes, you can do it on php & mysql ;)

judotens
  • 21
  • 3
1

How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ? Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.

You might also want to add some kind of dictionary of words you don't want to count

Dominik
  • 1,194
  • 6
  • 9
1

What you need is either

  1. document classification, or..
  2. automatic tagging

Probably second one. And only then you can count their popularity in time.

Artjom Kurapov
  • 6,115
  • 4
  • 32
  • 42
0

Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.

It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.

To go beyond that, you need natural language processing tools to determine the meaning of what is said.