0

I am looking for an overload with above mentioned signature.

I need to load from an XmlDocument because loading from the owl file directly or via a Stream results in an Error:

"The input document has exceeded a limit set by MaxCharactersFromEntities."

Is there something obvious which I am not aware of?

Thanks, Jan

Edit 1 - Adding code showing exception

I try to parse the cell line ontology (~100MB). Because I need only some specific content, I would like to use a handler to focus on the interesting stuff. For demonstartion of my issue, I use the CountHandler

private static void loadCellLineOntology()
    {
        try
        {
            var settings = new System.Xml.XmlReaderSettings()
            {
                MaxCharactersFromEntities = 0,
                DtdProcessing = System.Xml.DtdProcessing.Parse
            };

            var doc = new System.Xml.XmlDocument();
            var parser = new VDS.RDF.Parsing.RdfXmlParser(VDS.RDF.Parsing.RdfXmlParserMode.DOM);

            //using (var stream = new System.IO.FileStream(@"C:\Users\jan.hummel\Downloads\clo.owl", System.IO.FileMode.Open))
            //using (var reader = System.Xml.XmlReader.Create(stream, settings))
            using (IGraph g = new NonIndexedGraph())
            {
                //doc.Load(reader);
                //parser.Load(g, @"C:\Users\jahu\Downloads\clo.owl");

                var handler = new VDS.RDF.Parsing.Handlers.CountHandler();
                parser.Load(handler, @"C:\Users\jahu\Downloads\clo.owl");
                //parser.Load(handler, doc);
            }
        }
        catch (Exception ex)
        {
            Debugger.Break();
        }
    }
Samu Lang
  • 2,261
  • 2
  • 16
  • 32
jahu
  • 533
  • 4
  • 15

1 Answers1

1

There's nothing obvious. The overload you're looking for doesn't exist, and the RDF/XML parser infrastructure doesn't allow you to set XmlReaderSettings.MaxCharactersFromEntities.

I was able to work around this by replicating the relevant parts of the parser as far down as to change that setting. Beware this is relying on internal implementation details, hence all the private dispatching using Reflection.

The interesting bit is at CellLineOntology.RdfXmlParser.Context.Generator.ctor(Stream).

If you have the code below, you can call

var handler = new VDS.RDF.Parsing.Handlers.CountHandler();
CellLineOntology.RdfXmlParser.Load(handler, @"..\..\..\..\clo.owl");

I get a count of 1,387,097 statements using the file you linked.


namespace CellLineOntology
{
    using System;
    using System.IO;
    using System.Reflection;
    using System.Xml;
    using VDS.RDF;
    using VDS.RDF.Parsing.Contexts;
    using VDS.RDF.Parsing.Events;
    using VDS.RDF.Parsing.Events.RdfXml;
    using VDS.RDF.Parsing.Handlers;

    internal class RdfXmlParser
    {
        public static void Load(IRdfHandler handler, string filename)
        {
            using (var input = File.OpenRead(filename))
            {
                Parse(new Context(handler, input));
            }
        }

        private static void Parse(RdfXmlParserContext context) => typeof(VDS.RDF.Parsing.RdfXmlParser).GetMethod("Parse", BindingFlags.Instance | BindingFlags.NonPublic).Invoke(new VDS.RDF.Parsing.RdfXmlParser(), new[] { context });

        private class Context : RdfXmlParserContext
        {
            private IEventQueue<IRdfXmlEvent> _queue
            {
                set => typeof(RdfXmlParserContext).GetField("_queue", BindingFlags.Instance | BindingFlags.NonPublic).SetValue(this, value);
            }

            public Context(IRdfHandler handler, Stream input)
                : base(handler, Stream.Null)
            {
                _queue = new StreamingEventQueue<IRdfXmlEvent>(new Generator(input, ToSafeString(GetBaseUri(handler))));
            }

            private static Uri GetBaseUri(IRdfHandler handler) => (Uri)typeof(HandlerExtensions).GetMethod("GetBaseUri", BindingFlags.Static | BindingFlags.NonPublic).Invoke(null, new[] { handler });

            private static string ToSafeString(Uri uri) => (uri == null) ? string.Empty : uri.AbsoluteUri;

            private class Generator : StreamingEventGenerator
            {
                private XmlReader _reader
                {
                    set => typeof(StreamingEventGenerator).GetField("_reader", BindingFlags.Instance | BindingFlags.NonPublic).SetValue(this, value);
                }

                private bool _hasLineInfo
                {
                    set => typeof(StreamingEventGenerator).GetField("_hasLineInfo", BindingFlags.Instance | BindingFlags.NonPublic).SetValue(this, value);
                }

                private string _currentBaseUri
                {
                    set => typeof(StreamingEventGenerator).GetField("_currentBaseUri", BindingFlags.Instance | BindingFlags.NonPublic).SetValue(this, value);
                }

                public Generator(Stream stream)
                    : base(Stream.Null)
                {
                    var settings = GetSettings();

                    // This is why we're here
                    settings.MaxCharactersFromEntities = 0;

                    var reader = XmlReader.Create(stream, settings);

                    _reader = reader;
                    _hasLineInfo = reader is IXmlLineInfo;
                }

                public Generator(Stream stream, string baseUri)
                    : this(stream)
                {
                    _currentBaseUri = baseUri;
                }

                private XmlReaderSettings GetSettings() => (XmlReaderSettings)typeof(StreamingEventGenerator).GetMethod("GetSettings", BindingFlags.Instance | BindingFlags.NonPublic).Invoke(this, null);
            }
        }
    }
}
Samu Lang
  • 2,261
  • 2
  • 16
  • 32
  • Thank you @Samu Lang! Your solution works perfect. However I would love to understand why the overload is missing. – jahu Oct 01 '19 at 10:22
  • I imagine because it was never needed. I don't know any reason why it shouldn't be there otherwise. One could add it fairly easily, because the [RdfXmlParserContext already handles that case](https://github.com/dotnetrdf/dotnetrdf/blob/master/Libraries/dotNetRDF/Parsing/Contexts/RdfXmlParserContext.cs#L69-L75). Mind you loading the XML into DOM (XmlDocument) is not necessarily your best option though if you have a really large file like this one. – Samu Lang Oct 01 '19 at 14:41