0

Given a URL, the URL of the webpage that first URL is on, the DOM of the webpage, and a list of the rest of the URLs on the webpage how can I reliably determine if the URL is in the header/footer of the page or if it's in neither?

I'm using C#/.NET.

I know that no solution is perfect since webpages are not semantically expressed and also because some websites/pages specifically obfuscate their pages, but I would like to build some logic that would work for say 75% of webpages.

Also, are there other pieces of information that would be helpful to determine the location of the URL in the page?

Chad
  • 3,159
  • 4
  • 33
  • 43

1 Answers1

0

I think the creative task here is to define "header" and "footer", as in "content less than x units away from the top", or "the last 200 characters on the page". Once you have accomplished this, you can parse the page based on those rules.

cdonner
  • 37,019
  • 22
  • 105
  • 153
  • Yeah, that's exactly what the question is asking for... heuristics (one of the question's tags) to label a URL as being in the header or footer. I know I need to define these very broad ideas. I'm looking from everything simple (e.g. One of the first x links on a page) to very complex (backtracking in the DOM looking for containers that look like headers and footers). I would like to emphasize simple heuristics as I'm looking for 75% of sites. This 75% is what I consider, well-behaving pages. I'm not going to spend 90% of my time on the other 25% of pages. Thanks. – Chad Jul 21 '10 at 04:55
  • Furthermore, I want "header" and "footer" to be what you typically consider a header and footer on a webpage. It tends to be obvious when you look at a page, but obviously not immediately apparent when just looking at the HTML of a page. This is part of the challenge of the question, I want to try to identify heuristics that can tag a URL as being in the header/footer. **I don't want to constrain the idea of a header/footer, rather I want to adapt to each page as best as possible**. – Chad Jul 21 '10 at 19:06