2

I am working on extraction of keywords. The system takes a URL as input and the output is supposed to be keywords describing the contents of the URL. We are considering only textual parts now. I would like to know what methods I can employ for extracting keywords from URLs and how they compare with each other. Suggestions and redirections are welcome.

dknight
  • 1,243
  • 10
  • 23
  • What language are you using? Different ways in different languages... – kfox Feb 18 '11 at 19:44
  • I thought the techniques would not depend on the programming language chosen. If however, they do, then I can use C, python, lisp and a friend can work in php and java/.NET. – dknight Feb 19 '11 at 07:19
  • Are you looking at just a single URL, or multiple URLs from across a whole domain? – Joel Feb 28 '11 at 11:59
  • Well the input will be a URL. Now if the URL is like http://intosimple.blogspot.com/2011/03/beauty-of-gentoo-installation.html then it is easier; but if it http://intosimple.blog.com/ then task has to be subdivided. – dknight Mar 08 '11 at 11:07

1 Answers1

1

i think you can use this method

read the site with urllib ( http://docs.python.org/library/urllib2.html?highlight=urllib2#module-urllib2 ) and then remove tags and create plane text of site

then check which word are used more. then create top tens ( or count )

Mohammad Efazati
  • 4,812
  • 2
  • 35
  • 50
  • I am using pycurl to fetch and process the web page. Well the essence of the question is how to process the plain text obtained from the site. – dknight Mar 08 '11 at 11:09
  • 1
    create dictiony for each word and count this word {"word":count} then show top 10 word. ps: delete html and js tag – Mohammad Efazati Mar 08 '11 at 11:23