I have a huge chunk of XML data that I need to "clean". The Xml looks something like this:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
<w:t>F_ck</w:t>
<!-- -->
<w:t>F_ck</w:t>
<!-- -->
<w:t>F_ck</w:t>
</w:p>
</w:body>
</w:document>
I would like to identify the <w:t>
-elements with the value "F_ck" and replace the value with something else. The elements I need to clean will be scattered throughout the document.
I need the code to run as fast as possible and with a memory footprint as small as possible, so I am reluctant to use the XDocument
(DOM) approaches I have found here and elsewhere.
The data is given to me as a stream containing the Xml data, and my gut feeling tells me that I need the XmlTextReader
and the XmlTextWriter
.
My original idea was to do a SAX-mode, forward-only run through the Xml data and "pipe" it over to the XmlTextWriter
, but I cannot find an intelligent way to do so.
I wrote this code:
var reader = new StringReader(content);
var xmltextReader = new XmlTextReader(reader);
var memStream = new MemoryStream();
var xmlWriter = new XmlTextWriter(memStream, Encoding.UTF8);
while (xmltextReader.Read())
{
if (xmltextReader.Name == "w:t")
{
//xmlWriter.WriteRaw("blah");
}
else
{
xmlWriter.WriteRaw(xmltextReader.Value);
}
}
The code above only takes the value of elements declaration etc, so no brackets or anything. I realize that I could write code that specifically executed .WriteElement()
, .WriteEndElement()
etc depending on the NodeType
, but I fear that will quickly be a mess.
So the question is:
How do I - in a nice way - pipe the xml data read from the XmlTextReader
to the XmlTextWriter
while still being able to manipulate the data while piping?