0

How can I determine whether a game is 'arcade' or 'sports' or 'strategy' by parsing its webpage. I am talking of small-little flash games that are hosted on web pages.

For instance, take a look at these web pages: http://www.miniclip.com/games/ski-safari/en/ OR http://www.2dplay.com/the-last-dino/the-last-dino-play.htm

Are there services that exist to do some sort of 'categorization'? Are there existing NLP algorithms that can help?

mynk
  • 1,194
  • 2
  • 13
  • 16
  • Question too broad? I thought this is a very rarely occurring problem. Is stack-overflow meant only for suggestions on common problems? – mynk Jan 11 '14 at 10:10

1 Answers1

1

You can extract relevant text from a webpage and use bag of words approach to do classification. In simplest case, you just define game categories and list of keywords for each of them. The more keywords for a category are on the page, the more likely the game belongs to that category.

For more sophisticated approach take a look at classification algorithms (e.g. Naive Bayes) and text-specific features (e.g. tf-idf).

Also note, that extracting relevant text from a page is important here. If, for example, page contains a couple of words about this specific game and list of related news (describing other games), then snippets from relevant news may lower your accuracy a lot.

ffriend
  • 27,562
  • 13
  • 91
  • 132