2

I was wondering how Google Reader extracts news items from a web page.

Does any of you know how it works? Or how someone can build a similar system to extract the same information from the HTML of a web page.

Obviously it is not using a standard (nor is it only reading RSS/ATOM), because Google Reader proves that it can read the content of the page regardless of how the markup looks like.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Mo Valipour
  • 13,286
  • 12
  • 61
  • 87
  • Google Reader doesn't have the feature you describe. It used to have a "track changes" feature (http://googlereader.blogspot.com/2010/01/follow-changes-to-any-website.html), but it was removed (http://googlereader.blogspot.com/2010/09/turning-off-track-changes-feature.html). – Mihai Parparita Dec 25 '11 at 04:33
  • So why subscribing to any blog-type web page is showing news correctly? e.g. http://jesseliberty.com/ – Mo Valipour Dec 25 '11 at 12:28
  • 1
    http://jesseliberty.com/ has an RSS feed, which is signaled by the presence of the element. When given the URL of a regular page, Google Reader (and other RSS readers) look for this "autodiscovery" element" and subscribe to the feed URL that it points to. – Mihai Parparita Dec 26 '11 at 21:45
  • Thanks Mihai, you are a hero :) – Mo Valipour Dec 26 '11 at 22:14
  • Since they appear to be help helpful, I've posted the contents of my comments as an answer. – Mihai Parparita Dec 26 '11 at 22:55

2 Answers2

1

Google Reader does not currently do any kind of extraction of content from raw web pages. It used to have a "track changes to arbitrary pages" feature, but that was removed more than a year ago.

When given an URL that is not that of a feed, Google Reader fetches its contents. If the contents are HTML, it looks for an autodiscovery element of the form <link rel="alternate" type="application/atom+xml" href="feed.xml">. If found, it subscribes to the feed.

Mihai Parparita
  • 4,236
  • 1
  • 23
  • 30
-2

You already answered your question by tagging your question with "RSS".

Anyway, Google Reader like all other RSS/Atom-Readers read an RSS or an Atom feed. You may want to have a look at the corresponding wikipedia article: http://en.wikipedia.org/wiki/RSS

radkappe
  • 1
  • 2
  • 1
    This is not right, google reader also reads content from html pages and this is the subject of this question. RSS is added to tags to grab the attention of people interested in RSS. – Mo Valipour Dec 20 '11 at 23:23
  • oh, sorry! I wasn't aware of that feature (which apparently only works in the english version and for english pages). But that would also probably mean they somehow parse the sentences/words in the site itself... – radkappe Dec 20 '11 at 23:30
  • it works for other languages (as far as I'm working with it) as well ;) – Mo Valipour Dec 20 '11 at 23:32