What method should I employ to extract keywords from a URL?

Question

I am working on extraction of keywords. The system takes a URL as input and the output is supposed to be keywords describing the contents of the URL. We are considering only textual parts now. I would like to know what methods I can employ for extracting keywords from URLs and how they compare with each other. Suggestions and redirections are welcome.

What language are you using? Different ways in different languages... — kfox, Feb 18 '11 at 19:44
I thought the techniques would not depend on the programming language chosen. If however, they do, then I can use C, python, lisp and a friend can work in php and java/.NET. — dknight, Feb 19 '11 at 07:19
Are you looking at just a single URL, or multiple URLs from across a whole domain? — Joel, Feb 28 '11 at 11:59
Well the input will be a URL. Now if the URL is like http://intosimple.blogspot.com/2011/03/beauty-of-gentoo-installation.html then it is easier; but if it http://intosimple.blog.com/ then task has to be subdivided. — dknight, Mar 08 '11 at 11:07

score 1 · Answer 1 · answered Feb 28 '11 at 11:52

1

i think you can use this method

read the site with urllib ( http://docs.python.org/library/urllib2.html?highlight=urllib2#module-urllib2 ) and then remove tags and create plane text of site

then check which word are used more. then create top tens ( or count )

answered Feb 28 '11 at 11:52

Mohammad Efazati

4,812
2
35
50

I am using pycurl to fetch and process the web page. Well the essence of the question is how to process the plain text obtained from the site. – dknight Mar 08 '11 at 11:09
1

create dictiony for each word and count this word {"word":count} then show top 10 word. ps: delete html and js tag – Mohammad Efazati Mar 08 '11 at 11:23

What method should I employ to extract keywords from a URL?

1 Answers1