0

I'm currently using the following method to read in rss feeds:

 if (!String.IsNullOrEmpty(rawxml) && rawxml.Contains("<rss"))//RSS Feeds
 {
      using (StringReader sr = new StringReader(rawxml)) 
      { 
          XmlReader xmlReader = XmlReader.Create(sr);
          SyndicationFeed rssfeed = SyndicationFeed.Load(xmlReader);
          xmlReader.Close();
           //do stuff with the SyndicationFeed rssfeed
       }
  }

This code is going to be handling several different news sources and with all of the different types of errors that can happen with the varying rss feeds during the SyndicationFeed.Load process, I want to simplify the rss feed before I load it into a SyndicationFeed (which is a string format, named rawxml in the code) so that the items in the rss feed ONLY contain these child elements:

<item>
    <title>*</title>
    <link>*</link>
    <description>*</description>
    <pubDate>*</pubDate>
</item> 

I am currently looking at using a regex pattern to strip out all the children elements under the <item> elements that aren't titles, links, descriptions or pubDates. I would do this using the following additional code:

  string pattern =  @"some pattern here";
  Regex rgx = new Regex(pattern);
  string result = rgx.Replace(rawxml, "");

The problem is I am not sure how to write a pattern that would remove those unnecessary elements without destroying the children elements I want to keep. Is there a way to select those nested elements? A second strategy I have been looking at is using XPath to select those nodes, but I'm not sure how to remove children nodes from an XMLReader.

UPDATE:

I have decided to pull away from REGEX for the time and I'm looking at using XDocument and XPath to select all the nodes I don't want and to remove them from the feed. The following is what I have so far:

if (!String.IsNullOrEmpty(rawxml) && rawxml.Contains("<rss"))//RSS Feeds
{
    //Create XML and remove unneeded xml nodes
    var xdoc = XDocument.Parse(rawxml);
    foreach (var item in xdoc.XPathSelectElements("//item/??some/xpath/here/to/get/unwanted/children"))
    {
        item.RemoveNodes();
        item.RemoveAll();
    }
        //Feed in the cleaned up xml into SyndicationFeed
        using (XmlReader r = xdoc.CreateReader())
        {
            SyndicationFeed rssfeed = SyndicationFeed.Load(r);
            //Do stuff
        }
    }
 }
Mr Lister
  • 45,515
  • 15
  • 108
  • 150
War Gravy
  • 1,543
  • 3
  • 20
  • 32

2 Answers2

0

RegEx is not a suitable tool for modifying XML documents. What you're trying to do is a transformation, and there is a standardised technology for transforming XML documents: XSLT. All required types are in the System.Xml.Xsl namespace, and there's also a guide describing how to do an XSL transformation in .NET.

jan.h
  • 524
  • 5
  • 11
  • Thank you, so I am looking at this (http://stackoverflow.com/questions/34093/how-to-apply-an-xslt-stylesheet-in-c-sharp) for the C# portion, but now I am reading on what that XSLT is going to look like because I have never written XSLT before. – War Gravy Mar 10 '16 at 23:51
0

LINQ and XDocument was more straight forward to use and solved the solution. Here is what the solution I used looks like for anyone coming here that is trying to limit the amount of errors they get while reading RSS feeds. I ended up just not using SyndicationFeed overall, but for those interested in still using that they can use the .RemoveAll() operation on the XNodes.

        if (!String.IsNullOrEmpty(rawxml) && rawxml.Contains("<rss"))
        {
            //Create XML
            XDocument xdoc = XDocument.Parse(rawxml);
            foreach (var item in xdoc.Descendants("item")) {
                //set temporary variables
                foreach(var child in item.Descendants().Where(x => 
                x.Name.ToString().ToLower() == "description" ||
                x.Name.ToString().ToLower() == "link" ||
                x.Name.ToString().ToLower() == "title" ||
                x.Name.ToString().ToLower() == "pubdate"
                )){
                      //grab elements with a switch statement
                      //do your operations
                }
          }
War Gravy
  • 1,543
  • 3
  • 20
  • 32