I'm doing an algorithm to classify the relevance of a page for some theme like 'movies' using all meta information as possible, but excluding the textual content of the body.
I want to know what can I use to determine if a page has some info about the theme.
At the moment, I'm giving an importance of 40% for the title, 30% for the link after the domain, 20% for the domain and 10% for the meta keywords, but I think I can use more thing to be more precise. I'm matching some words with some weighting to calculate the relevance of the page.
Any ideas of what more can I use to calculate the relevance? I only want to exclude the text-content inside HTML itself, but the HTML structure can be used.