1

I have a huge chunk of XML data that I need to "clean". The Xml looks something like this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
    <w:body>
        <w:p>       
                    <w:t>F_ck</w:t>
            <!-- -->
                <w:t>F_ck</w:t>
            <!-- -->
                            <w:t>F_ck</w:t>
        </w:p>
    </w:body>
</w:document>

I would like to identify the <w:t>-elements with the value "F_ck" and replace the value with something else. The elements I need to clean will be scattered throughout the document.

I need the code to run as fast as possible and with a memory footprint as small as possible, so I am reluctant to use the XDocument (DOM) approaches I have found here and elsewhere.

The data is given to me as a stream containing the Xml data, and my gut feeling tells me that I need the XmlTextReader and the XmlTextWriter.

My original idea was to do a SAX-mode, forward-only run through the Xml data and "pipe" it over to the XmlTextWriter, but I cannot find an intelligent way to do so.

I wrote this code:

var reader = new StringReader(content);
var xmltextReader = new XmlTextReader(reader);
var memStream = new MemoryStream();
var xmlWriter = new XmlTextWriter(memStream, Encoding.UTF8);

while (xmltextReader.Read())
{
    if (xmltextReader.Name == "w:t")
    {
        //xmlWriter.WriteRaw("blah");
    }
    else
    {
        xmlWriter.WriteRaw(xmltextReader.Value);
    }
}

The code above only takes the value of elements declaration etc, so no brackets or anything. I realize that I could write code that specifically executed .WriteElement(), .WriteEndElement() etc depending on the NodeType, but I fear that will quickly be a mess.

So the question is:

How do I - in a nice way - pipe the xml data read from the XmlTextReader to the XmlTextWriter while still being able to manipulate the data while piping?

Jesper Lund Stocholm
  • 1,973
  • 2
  • 27
  • 49
  • The 'w' is called a prefix and is defined by the namespace : xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main". What are you trying to do? The document doesn't need to be cleaned to de-serialize. – jdweng Nov 05 '15 at 13:06
  • @jdweng I know what a namespace is :-) . I am not trying to solve deserialization. I am trying to find the "best" way to replace the values of certain elements in the Xml data. – Jesper Lund Stocholm Nov 05 '15 at 13:35
  • Use XDocument (xml linq). Find tags and then simply replace values. – jdweng Nov 05 '15 at 15:20
  • @jdweng Yes, I will see if I can get it to work, but as I wrote in the OP, I am reluctant to use XDocument due to it's memory footprint. So I'm keeping the post open a bit more in the hope that I can get help on using XmlTextReader/Writer instead :-) – Jesper Lund Stocholm Nov 05 '15 at 19:05
  • If you are concerned with speed or memory try the code at following website. It is a 6MByte XML file that runs in a couple of seconds if you download the xml file to local disk. http://stackoverflow.com/questions/33506815/xml-mixed-content-model-with-complex-types-ssis-error/33515451#comment54844570_33515451 – jdweng Nov 05 '15 at 22:11

1 Answers1

0

Try this

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string xml =
                "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?>" +
                "<w:document xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">" +
                    "<w:body>" +
                        "<w:p>" +
                                    "<w:t>F_ck</w:t>" +
                            "<!-- -->" +
                                "<w:t>F_ck</w:t>" +
                            "<!-- -->" +
                                            "<w:t>F_ck</w:t>" +
                        "</w:p>" +
                    "</w:body>" +
                "</w:document>";

            XDocument doc = XDocument.Parse(xml);
            XElement document = (XElement)doc.FirstNode;
            XNamespace ns_w = document.GetNamespaceOfPrefix("w");
            List<XElement> ts = doc.Descendants(ns_w + "t").ToList();
            foreach (XElement t in ts)
            {
                t.Value = "abc";
            }

        }
    }
}
​
jdweng
  • 33,250
  • 2
  • 15
  • 20