Most efficient way to replace text in xml stream

Question

I have a huge chunk of XML data that I need to "clean". The Xml looks something like this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
    <w:body>
        <w:p>       
                    <w:t>F_ck</w:t>
            <!-- -->
                <w:t>F_ck</w:t>
            <!-- -->
                            <w:t>F_ck</w:t>
        </w:p>
    </w:body>
</w:document>

I would like to identify the <w:t>-elements with the value "F_ck" and replace the value with something else. The elements I need to clean will be scattered throughout the document.

I need the code to run as fast as possible and with a memory footprint as small as possible, so I am reluctant to use the XDocument (DOM) approaches I have found here and elsewhere.

The data is given to me as a stream containing the Xml data, and my gut feeling tells me that I need the XmlTextReader and the XmlTextWriter.

My original idea was to do a SAX-mode, forward-only run through the Xml data and "pipe" it over to the XmlTextWriter, but I cannot find an intelligent way to do so.

I wrote this code:

var reader = new StringReader(content);
var xmltextReader = new XmlTextReader(reader);
var memStream = new MemoryStream();
var xmlWriter = new XmlTextWriter(memStream, Encoding.UTF8);

while (xmltextReader.Read())
{
    if (xmltextReader.Name == "w:t")
    {
        //xmlWriter.WriteRaw("blah");
    }
    else
    {
        xmlWriter.WriteRaw(xmltextReader.Value);
    }
}

The code above only takes the value of elements declaration etc, so no brackets or anything. I realize that I could write code that specifically executed .WriteElement(), .WriteEndElement() etc depending on the NodeType, but I fear that will quickly be a mess.

So the question is:

How do I - in a nice way - pipe the xml data read from the XmlTextReader to the XmlTextWriter while still being able to manipulate the data while piping?

The 'w' is called a prefix and is defined by the namespace : xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main". What are you trying to do? The document doesn't need to be cleaned to de-serialize. — jdweng, Nov 05 '15 at 13:06
@jdweng I know what a namespace is :-) . I am not trying to solve deserialization. I am trying to find the "best" way to replace the values of certain elements in the Xml data. — Jesper Lund Stocholm, Nov 05 '15 at 13:35
Use XDocument (xml linq). Find tags and then simply replace values. — jdweng, Nov 05 '15 at 15:20
@jdweng Yes, I will see if I can get it to work, but as I wrote in the OP, I am reluctant to use XDocument due to it's memory footprint. So I'm keeping the post open a bit more in the hope that I can get help on using XmlTextReader/Writer instead :-) — Jesper Lund Stocholm, Nov 05 '15 at 19:05
If you are concerned with speed or memory try the code at following website. It is a 6MByte XML file that runs in a couple of seconds if you download the xml file to local disk. http://stackoverflow.com/questions/33506815/xml-mixed-content-model-with-complex-types-ssis-error/33515451#comment54844570_33515451 — jdweng, Nov 05 '15 at 22:11

score 0 · Answer 1 · answered Nov 05 '15 at 15:20

Try this

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string xml =
                "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?>" +
                "<w:document xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">" +
                    "<w:body>" +
                        "<w:p>" +
                                    "<w:t>F_ck</w:t>" +
                            "<!-- -->" +
                                "<w:t>F_ck</w:t>" +
                            "<!-- -->" +
                                            "<w:t>F_ck</w:t>" +
                        "</w:p>" +
                    "</w:body>" +
                "</w:document>";

            XDocument doc = XDocument.Parse(xml);
            XElement document = (XElement)doc.FirstNode;
            XNamespace ns_w = document.GetNamespaceOfPrefix("w");
            List<XElement> ts = doc.Descendants(ns_w + "t").ToList();
            foreach (XElement t in ts)
            {
                t.Value = "abc";
            }

        }
    }
}

Why you all like XDocument? It is extremely slow and memory-hungry — vitalygolub, Nov 05 '15 at 15:43
Is it much better than XmlDocument? XDocument is less instructions and easier to extract tags. — jdweng, Nov 05 '15 at 18:09

Most efficient way to replace text in xml stream

1 Answers1