-1

I'm reading in an XML stream that's approximately 100mb, and I'd like to replace values that are over 1mb.

example input

<root>
    <visit>yes</visit>
    <filedata>SDFSFDSDFfgdfgsgdf==(this is 5 mb)</filedata>
    <type>pdf</type>
    <moredata>sssssssssssssss (this 2mb)</moredata>
</root>

expected output

<root>
    <visit>yes</visit>
    <filedata>REPLACED TEXT</filedata>
    <type>pdf</type>
    <moredata>REPLACED TEXT</moredata>
</root>

Here's what I am using to read the stream, as well as checking the size:

XmlReader rdr = XmlReader.Create (new System.IO.StringReader (xml));
while (rdr.Read ()) {
    if (rdr?.Value.Length > ONEMEGABYTE) {
        //replace value with "REPLACE TEXT"}
    }

How do I replace the value in rdr.Value?

Alex Gordon
  • 57,446
  • 287
  • 670
  • 1,062
  • 2
    You don't. `XmlReader` *reads*, as the name implies. You can write a wrapping reader that truncates the results for other consumers, use `XElement` (or `XmlDocument` if you really must) or inject some other logic in your processing step, but not in a loop that reads. – Jeroen Mostert May 22 '19 at 15:19
  • 2
    Just parse the XML (XElement or XmlDocument), find the node you want, and set `Value`. – 15ee8f99-57ff-4f92-890c-b56153 May 22 '19 at 15:20
  • I usually read one section of xml at a time and then parse into an XElement. See : https://stackoverflow.com/questions/40944048/reading-very-large-xml-bz2-files?rq=1 – jdweng May 22 '19 at 16:00
  • @jdweng thanks! but i'm not seeing how this actually alters the node value – Alex Gordon May 22 '19 at 16:05
  • Once you get the XElement(s) you can use the Set Value method to make changes. – jdweng May 22 '19 at 16:11
  • @EdPlunkett the only way for me to know what node i want is by size, from what i understand the easiest way to iterate over all xml nodes, keeping performance in mind is by using `XmlReader` – Alex Gordon May 22 '19 at 16:42
  • XmlDocument.Load() has an overload that takes an XmlReader. I wonder about creating a subclass that skips over undesired elements in Read()? – 15ee8f99-57ff-4f92-890c-b56153 May 22 '19 at 17:01

3 Answers3

1

You can subclass XmlReader to "filter" out undesired elements, then use XmlDocument.Load() with your reader instead of letting it create its own.

Note that this will exclude only the value of the offending tags: If you put a breakpoint in your Read() loop, you'll find that <foo>bar</foo> comes in three pieces: <foo> has NodeType Element with no value, "bar" has NodeType Text, with an empty LocalName, and </foo> is NodeType EndElement with no value. If "bar" were over the limit length, the "filter" below would turn <foo>bar</foo> into <foo></foo> To exclude all of <foo>bar</foo> based on the length of "bar", you'd have to look ahead. Doable, but maybe not worth your time. Hopefully that's not a requirement here.

An alternative (or addition) to this class might be a version of this with a Func<string, string> that every Value is passed through: s => (s.Length > MAX_LEN) ? "" : s.

Also, for all I know, XmlTextReaderImpl (the actual type of _reader) may cache the whole text and kill your performance anyway. You may have to write your own guts for the thing as well.

public class FilteredXmlReader : XmlReader
{
    public Func<XmlReader, bool> Filter;

    private XmlReader _reader;
    private FilteredXmlReader(TextReader input, Func<XmlReader, bool> filterProc)
    {
        Filter = filterProc;
        _reader = XmlReader.Create(input);
    }

    public static new XmlReader Create(TextReader input, Func<XmlReader, bool> filterProc)
    {
        return new FilteredXmlReader(input, filterProc);
    }

    public override bool Read()
    {
        var b = _reader.Read();

        while (!(bool)Filter?.Invoke(_reader))
        {
            b = _reader.Read();
        }

        return b;
    }

    #region Wrapper Boilerplate

    public override XmlNodeType NodeType => _reader.NodeType;

    public override string LocalName => _reader.LocalName;

    public override string NamespaceURI => _reader.NamespaceURI;

    public override string Prefix => _reader.Prefix;

    public override string Value => _reader.Value;

    public override int Depth => _reader.Depth;

    public override string BaseURI => _reader.BaseURI;

    public override bool IsEmptyElement => _reader.IsEmptyElement;

    public override int AttributeCount => _reader.AttributeCount;

    public override bool EOF => _reader.EOF;

    public override ReadState ReadState => _reader.ReadState;

    public override XmlNameTable NameTable => _reader.NameTable;

    public override string GetAttribute(string name) => _reader.GetAttribute(name);

    public override string GetAttribute(string name, string namespaceURI) => _reader.GetAttribute(name, namespaceURI);

    public override string GetAttribute(int i) => _reader.GetAttribute(i);

    public override string LookupNamespace(string prefix) => _reader.LookupNamespace(prefix);

    public override bool MoveToAttribute(string name) => _reader.MoveToAttribute(name);

    public override bool MoveToAttribute(string name, string ns) => _reader.MoveToAttribute(name, ns);

    public override bool MoveToElement() => _reader.MoveToElement();

    public override bool MoveToFirstAttribute() => _reader.MoveToFirstAttribute();

    public override bool MoveToNextAttribute() => _reader.MoveToNextAttribute();

    public override bool ReadAttributeValue() => _reader.ReadAttributeValue();

    public override void ResolveEntity() => _reader.ResolveEntity();

    #endregion Wrapper Boilerplate
}

Usage:

var xml = "<test />";
XmlDocument doc = new XmlDocument();

XmlReader rdr = FilteredXmlReader.Create(new System.IO.StringReader(xml), 
                    r => r?.Value.Length < 20);

var filteredXML = doc.OuterXml;
0

Here is an example of replacing using Xml Reader and Xml Linq

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;


namespace ConsoleApplication29
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            XmlReader reader = XmlReader.Create(FILENAME);

            while (!reader.EOF)
            {
                if (reader.Name != "visits")
                {
                    reader.ReadToFollowing("visits");
                }
                if (!reader.EOF)
                {
                    XElement visits = (XElement)XElement.ReadFrom(reader);
                    XElement filedata = visits.Element("filedata");
                    filedata.SetValue("New Data");

                }
            }

        }
    }
}

Here is xml I used

<root>
  <visits>
    <visit>yes</visit>
    <filedata>REPLACED TEXT</filedata>
    <type>pdf</type>
    <moredata>REPLACED TEXT</moredata>
  </visits>
</root>
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • where are the changes persisted? are you actually making changes to the file? – Alex Gordon May 22 '19 at 16:22
  • i need to have both the original and the resulted payload, i'm reading from a stream, but not understanding what is actually being mutated when we executed `SetValue` – Alex Gordon May 22 '19 at 16:23
  • also, please keep in mind that i'd like to ONLY modify nodes that are greater than 1mb, i'm not understanding how your example would allow me to do that, since it's assuming that i know the node names, i.e. `filedata` – Alex Gordon May 22 '19 at 16:36
0

We can achieve this by using XmlDocument. Getting all the child nodes of root node and then looping through all the nodes -

        XmlDocument Doc = new XmlDocument();
        Doc.Load(@"yourpath.xml");
        XmlNodeList xmlNodelist = Doc.DocumentElement.ChildNodes;
        foreach (XmlNode node in xmlNodelist)
        {
            if(node.InnerText.Length > ONEMEGABYTE)
            {
                node.InnerText = "new value";
            }
        }
        Doc.Save(@"yourpath.xml"); //will replace new changes in the source file.
Arpit Gupta
  • 1,209
  • 1
  • 22
  • 39