0

I'm working the InstaPaper API

I'm using this string to pull the content of the article.

$Bookmark_Text = $connection->getBookmarkText($Bookmark['bookmark_id']);

Unfortunately it is pulling the entire html and basically putting the HTML structure in my HTML.

Example.

<html>
<head></head>
<body>
    <html>
    <head>Instapaper Title</head>
    <body>InstaPaper Article Content</body>
    </html>
</body>
</html>

Any thoughts on how to just get "Instapaper article content"

Thanks!

Chris Olson
  • 1,021
  • 4
  • 12
  • 19

2 Answers2

1

Here’s some JS code that extracts only the article and removes Instapaper’s stuff (top and bottom bar for example).

html.replace(/^[\s\S]*<div id="story">|<\/div>[^<]*<div class="bar bottom">[\s\S]*$/gim, '');

Be aware that it may change as Instapaper’s HTML output changes.

drkbrd
  • 41
  • 1
0

Use a parser to extract the contents of <body>. PHP has some built in, but there are others out there which might be easier to use.

This should do it if $Bookmark_Text is a valid HTML document.

$dom = new DOMDocument();
$dom->loadHTML($Bookmark_Text);
$body = $dom->getElementsByTagName('body')->item(0);
$content = $body->ownerDocument->saveHTML($body);
freejosh
  • 11,263
  • 4
  • 33
  • 47
  • None of these seem to be able to pull just everything in the body. – Chris Olson May 19 '12 at 00:54
  • Are you sure the HTML in your example is exactly what's returned by the API? I was able to create an example using `DOMDocument`, but because the `` has text in it, that's parsed as a `

    ` and put into the body.

    – freejosh May 19 '12 at 02:33
  • added my code to the answer. If the returned document isn't valid HTML, maybe your only choice is trying a regular expression – freejosh May 19 '12 at 02:41