5

What I'm looking at doing is essentially the same thing a Tweet button or Facebook Share / Like button does, and that is to scrape a page and the most relevant title for a piece of data. The best example I can think of is when you're on the front page of a website with many articles and you click a Facebook Like Button. It will then get the proper information for the post relative to (nearest) the Like button. Some sites have Open Graph tags, but some do not and it still works.

Since this is done remotely, I only have control of the data that I want to target. In this case the data are images. Rather than retrieving just the <title> of the page, I am looking to somehow traverse the dom in reverse from the starting point of each image, and find the nearest "title". The problem is that not all titles occur before an image. However, the chance of the image occurring after the title in this case seems fairly high. With that said, it is my hope to make it work well for nearly any site.

Thoughts:

  • Find the "container" of the image and then use the first block of text.
  • Find the blocks of text in elements that contain certain classes ("description", "title") or elements (h1,h2,h3,h4).

Title backups:

  • Using Open Graph Tags
  • Using just the <title>
  • Using ALT tags only
  • Using META Tags

Summary: Extracting the images isn't the problem, it's how to get relevant titles for them.

Question: How would you go about getting relevant titles for each of the images? Perhaps using DomDocument or XPath?

stwhite
  • 3,156
  • 4
  • 37
  • 70
  • Honestly, after you scrape it with PHP, if you could hand it off via REST calls to a small Java Web server, you could use JSOUP to easily get access to all of those elements and attributes. JSOUP is like jQuery for Java and uses much the same syntax. I wish it was available for PHP as it would make your problem go away in seconds! – jamesmortensen May 19 '12 at 18:33
  • there are several libraries available that deal with content extraction from pages, although I don't know of one that deals directly with images. but you might get some ideas and directions, or be able to use them. here's one: http://code.google.com/p/boilerpipe/wiki/Components – Not_a_Golfer May 19 '12 at 18:34
  • 1
    Thanks for your thoughts. I've updated my question to target more of the "logic" behind getting relevant titles or descriptions for each image rather than how to get the images themselves. – stwhite May 19 '12 at 18:37
  • 1
    @stwhite I built an image search engine once, and even though the logic is not the same, what I did was index (besides ALT, TITLE, etc) - text within a reasonable distance in the dom of that image (I wanted different texts for various images in the page). It worked rather well, I don't remember the exact heuristics, but the general idea was that the closer the chunk of text is to the image, the more relevant it was. – Not_a_Golfer May 19 '12 at 18:42
  • @Not_a_Golfer that's exactly what I was just thinking. Essentially giving a score for it's relation to the images pin point. Do you remember if you weighted on certain tags (h1,h2,h3,h4,h5,p) or classes on tags? – stwhite May 19 '12 at 18:49
  • @stwhite I don't remember, but as far as I remember, the thing was distance to the image. but again, I didn't need to display text, just to make the image findable and relevant to keywords. – Not_a_Golfer May 19 '12 at 19:06
  • I think it might be nice to consider not only distance as in length of the node path, but distance as in pixels, because of absolute and relative positioning. – goat May 19 '12 at 19:39
  • @chris how would you propose doing that considering you won't physically be seeing the remote page? Is it even possible? – stwhite May 19 '12 at 19:48
  • @stwhite execute a web browser via command line, telling it to load the given url, so that it fully recreates the dom structure and loads all css etc...from that point, its straightforward javascript to find the pixel coordinates of any dom element. I don't know the easiest way to run your own javascript code after the page loads, but worst case you could write a small browser extension that just waits for the page to load and then injects your script. There's a lot of cool possibilities when harnessing the processing power of a real web browser. – goat May 19 '12 at 19:55
  • @Chris Just a thought, but I'd think that having to load all of the resources would be fairly slow wouldn't it? However, it seems that the best way to check distance would be to do it visually... – stwhite May 20 '12 at 20:49
  • ya, it would be a pretty large overhead. You only need to process it once though. just save the coords of text containers and images, you can calc distance on the fly easily after you save positions. – goat May 20 '12 at 20:52
  • @Chris do you have any idea on how to physically load the page? My goal is to make this fast and a lot like Facebook's way of posting status updates. If there was a way of physically loading the page with CURL and then performing javascript calculations, that would work... I'm just not sure you can do that with CURL. – stwhite May 20 '12 at 21:43
  • you need a web browser. you can execute web browsers programatically. – goat May 20 '12 at 21:46

1 Answers1

1

Your approach seems good enough, I would just give certain tags / attributes a weight and loop through them with XPath queries until I find something that exits and it's not void. Something like:

i = 0

while (//img[i][@src])
  if (//img[i][@alt])
    return alt
  else if (//img[i][@description])
    return description
  else if (//img[i]/../p[0])
    return p
  else
    return (//title)

  i++

A simple XPath example (function ported from my framework):

function ph_DOM($html, $xpath = null)
{
    if (is_object($html) === true)
    {
        if (isset($xpath) === true)
        {
            $html = $html->xpath($xpath);
        }

        return $html;
    }

    else if (is_string($html) === true)
    {
        $dom = new DOMDocument();

        if (libxml_use_internal_errors(true) === true)
        {
            libxml_clear_errors();
        }

        if ($dom->loadHTML(ph()->Text->Unicode->mb_html_entities($html)) === true)
        {
            return ph_DOM(simplexml_import_dom($dom), $xpath);
        }
    }

    return false;
}

And the actual usage:

$html = file_get_contents('http://en.wikipedia.org/wiki/Photography');

print_r(ph_DOM($html, '//img')); // gets all images
print_r(ph_DOM($html, '//img[@src]')); // gets all images that have a src
print_r(ph_DOM($html, '//img[@src]/..')); // gets all images that have a src and their parent element
print_r(ph_DOM($html, '//img[@src]/../..')); // and so on...
print_r(ph_DOM($html, '//title')); // get the title of the page
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • I've been reading about XPath and actually started testing some options, but can you expand on this? Finding the distance between nodes seems like a good idea to do, however I haven't come up with a solution just yet. – stwhite May 19 '12 at 21:02
  • @stwhite: Actually that was not my idea, you just start at the highest level of specificity (`img` tag) and work your way up, until you find something that you'd want to consider as descriptive. – Alix Axel May 20 '12 at 16:15
  • I realize this wasn't your initial idea, but do you have any ideas on how to get the distance between found nodes? For example, finding the position of the current image to a preceding H1 versus the distance from the image to a preceding h2. This would seemingly give a score of which is more likely to be a "better" title. Essentially it would really be about which came first or which is closer to the image. – stwhite May 20 '12 at 20:42
  • @stwhite: Just count the number of `/..`? Actually, I think the description can come before *and after* the image, you might wanna take a look at http://www.w3schools.com/xpath/xpath_syntax.asp and http://www.w3schools.com/xpath/xpath_axes.asp, namely `preceding` and `following`. – Alix Axel May 20 '12 at 21:59
  • I'm aware of preceding and following and have written a system for retrieving a series of elements, but the problem of just counting '/..' doesn't account for relative index position to the parent that may also contain an h1,h2. I'm essentially trying to find the Lowest Common Ancestor to help index from: http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=lowestCommonAncestor – stwhite May 20 '12 at 22:20
  • @stwhite: I don't see how that's going to help your objective, IMO the closer your element is to the image the more descriptive it'll be. Take that Wikipedia page for instance... Either way, if you're going down that road, you'll probably need to map each tag[index] => children and run BFS or similar to get the number of "jumps". But honestly, I'm not following... Common to what? Perhaps a dummy example is in order. – Alix Axel May 21 '12 at 01:18