2

Not sure how to phrase this question properly, but this is what I intend to achieve using the hypothetical scenario outlined below -

A user's email to me has just the SUBJECT and BODY, the subject being the topic of email, and the body being a description of the topic in just one paragraph of max 1000 words. Now I would like to analyse this paragraph (in the BODY) using some computer language (python, maybe), and then come up with a list of most important words from the paragraph with respect to the topic mentioned in the SUBJECT field.

For example, if the topic of email is say iPhone, and the body is something like "the iPhone redefines user-interface design with super resolution and graphics. it is fully touch enabled and allows users to swipe the screen"

So the result I am looking for is a sort of list with the key terms from the paragraph as related to iPhone. Example - (user-interface, design, resolution, graphics, touch, swipe, screen).

So basically I am looking at picking the most relevant words from the paragraph. I am not sure on what I can use or how to use to achieve this result. Searching on google, I read a little about Natural Language Processing and python and classification etc. I just need a general approach on how to go about this - using what technology/language, which area I have to read on etc..

Thanks!

EDIT:::

I have been reading up in the meantime. To be precise, I am looking at HOW TO do this, using WHAT TOOL:

Generate related tags from a body of text using NLP which are based on synonyms, morphological similarity, spelling errors and contextual analysis.

kallakafar
  • 725
  • 3
  • 11
  • 27
  • 1
    first you have to specify your problem more exactly. for example: are you looking for certain keywords, which are nouns inside the same sentence in which contains one of the words of the subject? this will make your investigation much easier. – devsnd Oct 31 '12 at 16:34
  • In the meantime I was googling, and found Tagaroo and openCalais to be the closest match of what I am looking for. Like how Tagaroo suggests some keywords matching what I am writing in the blog, if the same can be done on content in a webform then it would be awesome. Basically I am looking at automated tagging/ folksonomy/ automated metadata etc.. I just need to know how to approach with what tool. Thanks! – kallakafar Oct 31 '12 at 16:46

4 Answers4

3

A naïve approach, based on information theory :

Given a corpus of text ( in your example roughly > 1.000 e-mails if possible ), compute the entropy of every different word in the corpus.

Sort the result and keep only the XX most relevant and you have your tagging scheme.

I did once a statistical translator in python using cross-entropy of words from the same text in two differents langages, and it worked fairly well.

lucasg
  • 10,734
  • 4
  • 35
  • 57
1

Might be overkill, but this kind of tasks could probably be solved with the Python library Natural Language Toolkit - http://nltk.org/

ccpizza
  • 28,968
  • 18
  • 162
  • 169
1

I am not an expert but it seems like you really need to define a notion of "key term", "relevance", etc, and then put a ranking algorithm on top of that. This sounds like doing NLP, and as far as I know there is a python package called NLTK that might be useful in this field. Hope it helps!

Vandalay
  • 31
  • 6
1

As others have said, NLTK is probably the go-to tool for doing NLP in Python.

As for technique, you're looking for something like a similarity metric between pairs of words. For every word in the text, compute this for the content-bearing words in the title, and keep the top-N. Have a look at this paper for a survey of approaches, and see what NLTK gives you in terms of functionality. There is masses of research on this stuff, though, and you'll probably be happy with something fairly simple (depending on exactly what your application is). Point-wise mutual information is usually a good starting point.

Ben Allison
  • 7,244
  • 1
  • 15
  • 24