3

I want to detect if a feed has changed, the only way I can think of would be to hash the contents of the xml document and compare that to the last hash of the feed.

I am using XmlReader because SyndicationFeed uses it, so idealy I don't want to load the syndication feed unless the feed has been updated.

XmlReader reader = XmlReader.Create("http://www.extremetech.com/feed");
SyndicationFeed feed = SyndicationFeed.Load(reader);
superlogical
  • 14,332
  • 9
  • 66
  • 76
  • What are the consequences of a hash collision? That is, suppose two documents have the same hash. What's the worst thing that can happen? – Eric Lippert Oct 24 '11 at 19:09
  • 1
    I did some more tests, if this is your exact feed, there are some comments in this feed which change periodically even tho the non-comment xml tags never change, so I don't think a hash approach is going to work at all – MerickOWA Oct 24 '11 at 19:32
  • @MerickOWA I think I will just go with using the ID that is in the SyndicationItem.. might be easier :) And that way if the feed title or article is edited it won't be a problem! – superlogical Oct 24 '11 at 19:42
  • @superlogical I added another possibility which doesn't rely on Hashing and which should probably work in general, tho it is dependant on the server. – MerickOWA Oct 24 '11 at 20:08

3 Answers3

3

Why not just check the LastUpdatedTime of the feed? That's a built-in way of telling you whether something is new or not. Instead of hashing and storing a hash you would simply keep track of the LastUpdatedTime and compare it periodically to latest LastUpdatedTime:

using System;
using System.ServiceModel.Syndication;
using System.Xml;

public class MyClass
{
    private static DateTime _lastFeedTime = new DateTime(2011, 10, 10);

    public static void Main()
    {
        XmlReader reader = XmlReader.Create("http://www.extremetech.com/feed");
        SyndicationFeed feed = SyndicationFeed.Load(reader);

        if (feed.LastUpdatedTime.LocalDateTime > _lastFeedTime)
        {
            _lastFeedTime = feed.LastUpdatedTime.LocalDateTime;

            // load feed...
        }
    }
}
Paul Sasik
  • 79,492
  • 20
  • 149
  • 189
  • Yeah I considered that, but I just don't know how reliable that will be considering some feeds might not update that value. But then again I could be totally wrong :) Does Wordpress always play nice with that? The majority of the feeds I want to index will be Wordpress based – superlogical Oct 24 '11 at 18:48
  • The LastUpdatedTime is completely unreliable because it depends on the server cooperating. – usr Oct 24 '11 at 18:52
  • Give the DateTime thing a try first. Don't assume and solve a problem unless you have to. And yes, you depend on a 3rd party conforming to a standard but that happens all the time. And I can't think of a more useful piece of metadata than LastUpdatedTime to comply with. Non-compliance should result in physical punishment. ;-) – Paul Sasik Oct 24 '11 at 18:58
  • @PaulSasik I think I will just go with using the ID that is in the SyndicationItem – superlogical Oct 24 '11 at 19:43
3

If you really want to go the hash way you can do the following:

var client = new WebClient();

var content = client.DownloadData("http://www.extremetech.com/feed");

var hash = MD5.Create().ComputeHash(content);
var hashString = Convert.ToBase64String(hash);

// you can then compare hashes and if changed load it this way
XmlReader reader = XmlReader.Create(new MemoryStream(content));

Of course going this way you will detect any change in the content, even the slightest.

IMHO the best way to go is load the feed anyway and hash just the contents of the articles, you can hash any string like this:

var toHash = "string to hash";

var hash = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(toHash);
var hashString = Convert.ToBase64String(hash);

Hope this helps.

Maghis
  • 1,093
  • 1
  • 7
  • 15
2

A hash approach won't work in this case due to an XML comment added by some server side caching which constantly very frequently even when the actual feed never changes.

One thing you can do which works for this feed is use HTTP conditional requests to ask the server to give you the data only if its actually been modified since the last time you requested.

For example:

You'd have a global/member variable to hold the last modified datetime from your feed

    var lastModified = DateTime.MinValue;

Then each time you'd make a request like the following

    var request = (HttpWebRequest)WebRequest.Create( "http://www.extremetech.com/feed" );
    request.IfModifiedSince = lastModified; 
    try {

      using ( var response = (HttpWebResponse)request.GetResponse() ) {

        lastModified  = response.LastModified;

        using ( var stream = response.GetResponseStream() ) {

          //*** parsing the stream
          var reader = XmlReader.Create( stream );
          SyndicationFeed feed = SyndicationFeed.Load( reader );
          }
        }
      }
    catch ( WebException e ) {
      var response = (HttpWebResponse)e.Response;
      if ( response.StatusCode != HttpStatusCode.NotModified )
        throw; // rethrow an unexpected web exception
      }
MerickOWA
  • 7,453
  • 1
  • 35
  • 56
  • 1
    +1 for using HTTP properly. You can also use the EXPIRES header in the response (if its there) and the metadata in the feed to (last update date, update period and update frequency) to guide you as to when/how often you should next check for updates. – Nicholas Carey Oct 24 '11 at 20:12