1

I'm doing an algorithm to classify the relevance of a page for some theme like 'movies' using all meta information as possible, but excluding the textual content of the body.

I want to know what can I use to determine if a page has some info about the theme.

At the moment, I'm giving an importance of 40% for the title, 30% for the link after the domain, 20% for the domain and 10% for the meta keywords, but I think I can use more thing to be more precise. I'm matching some words with some weighting to calculate the relevance of the page.

Any ideas of what more can I use to calculate the relevance? I only want to exclude the text-content inside HTML itself, but the HTML structure can be used.

Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
  • Nowadays a number of sites use [dublin core](http://dublincore.org/) based headers (meta tags). Maybe this helps? – home Sep 03 '11 at 13:29
  • Your question title asks something (about page relevance) but the question content asks another (page theme/category). Do you want to classify if a webpage is in a category? Can you look at links anchor texts? – Felipe Hummel Sep 04 '11 at 00:14
  • @Felipe I edited the title, I want the relevance for some theme. The relevance of a page for movies, or music, or games, or IT, etc. With meta information, I means all that is not the content itself of the page (like this message). This is because the page can have a lot of things in different context like my question, the answer, the related questions, the adversiments, etc. About the anchors, looks a good idea, I will think about it. Thanks! – Renato Dinhani Sep 04 '11 at 02:23
  • @home Thanks for your idea, I take a better look at this, but I think that is no much pages use it,right? – Renato Dinhani Sep 04 '11 at 02:24

1 Answers1

0

I think you should think about the Main Menu links , and if is the case a Submenu links , so to make it more simple , LINKS . And you should also take in count the metadata . But still i em not sure what are you trying to achieve .

From what i understood you are trying to make some "relevancy" formula for a webpage .

Florin
  • 3,779
  • 5
  • 18
  • 15