I am working on extraction of keywords. The system takes a URL as input and the output is supposed to be keywords describing the contents of the URL. We are considering only textual parts now. I would like to know what methods I can employ for extracting keywords from URLs and how they compare with each other. Suggestions and redirections are welcome.
Asked
Active
Viewed 956 times
2
-
What language are you using? Different ways in different languages... – kfox Feb 18 '11 at 19:44
-
I thought the techniques would not depend on the programming language chosen. If however, they do, then I can use C, python, lisp and a friend can work in php and java/.NET. – dknight Feb 19 '11 at 07:19
-
Are you looking at just a single URL, or multiple URLs from across a whole domain? – Joel Feb 28 '11 at 11:59
-
Well the input will be a URL. Now if the URL is like http://intosimple.blogspot.com/2011/03/beauty-of-gentoo-installation.html then it is easier; but if it http://intosimple.blog.com/ then task has to be subdivided. – dknight Mar 08 '11 at 11:07
1 Answers
1
i think you can use this method
read the site with urllib ( http://docs.python.org/library/urllib2.html?highlight=urllib2#module-urllib2 ) and then remove tags and create plane text of site
then check which word are used more. then create top tens ( or count )

Mohammad Efazati
- 4,812
- 2
- 35
- 50
-
I am using pycurl to fetch and process the web page. Well the essence of the question is how to process the plain text obtained from the site. – dknight Mar 08 '11 at 11:09
-
1create dictiony for each word and count this word {"word":count} then show top 10 word. ps: delete html and js tag – Mohammad Efazati Mar 08 '11 at 11:23