2

I'm trying to work out how to use Linq to XML on files which are currently structured but not strictly XML files. They are well formed files but do not contain an XML declaration. They are in fact SGML files.

At the moment i have:

private void Find_element_attribute_Click(object sender, EventArgs e)
{
     if (comboBox2.Text != "")
     {
         string[] projectFiles = Directory.GetFiles(path, typeExtention, SearchOption.AllDirectories);

         foreach (string file in projectFiles)
         {  
             XElement root = XElement.Load(file); 
             IEnumerable<XElement> selectedElement = from el in root.Elements(Element_textBox.Text)
                  where (string)el.Attribute(Attribute_textBox.Text) == Value_textBox.Text
                  select el; //need to selct the DMC and title and put in a variable, and list them

             foreach (XElement el in selectedElement)
                 MessageBox.Show("text" + el);
         }

     }
     MessageBox.Show("Please select a project to query");
}

This throws an exeption due to a '[' character on the second line. This character is the opening bracket for a list of entities within the document.

The only way i can think to make this work is to add an XML declaration to the beginning of my documents as i open them, then query the documents using Linq, then removing the declarations. However i've no idea how to go about this. Any help appreciated.

Start of my document looks like

<!--Arbortext, Inc., 1988-2009, v.4002-->
<!DOCTYPE DMODULE PUBLIC "-//AECMA CSDB//DTD Air Vehicle Engines Equipment Description 19980102//EN" [
<!ENTITY ICN-BR8412XXXXXXX-1CX-AG30000-A-K7626-01966-A01-1 SYSTEM "ICN-BR8412XXXXXXX-1CX-AG30000-A-K7626-01966-A01-1.cgm" NDATA cgm>
<!ENTITY ICN-BR8412XXXXXXX-1CX-AG30000-A-K7626-01964-A01-1 SYSTEM "ICN-BR8412XXXXXXX-1CX-AG30000-A-K7626-01964-A01-1.cgm" NDATA cgm>
<!ENTITY ICN-BR8412XXXXXXX-1CX-AG30000-A-K7626-01963-A01-1 SYSTEM "ICN-BR8412XXXXXXX-
]>
<dmodule><idstatus>
<dmaddres>
<dmc><avee><modelic>XXXXXXAXXXXXX</modelic><sdc>1AX</sdc><chapnum>AG3</chapnum>
<section>0</section><subsect>0</subsect><subject>00</subject><discode>01</discode>
<discodev>00</discodev><incode>018</incode><incodev>A</incodev><itemloc>A
</itemloc></avee></dmc>
<dmtitle><techname>Equipment - INTRODUCTION</techname><infoname>Introduction
</infoname>
</dmtitle>
<issno issno="001" type="new">
<issdate year="2012" month="11" day="30"></dmaddres>
<status>
<security class="3">
<rpc> </rpc>
<orig> </orig>
<applic></applic>
<techstd>
<autandtp>
<authblk>Chap 1</authblk>
<tpbase>8412(A)</tpbase>
</autandtp>
<authex></authex>
<notes></notes>
</techstd>
<qa>
<firstver type="tabtop"></qa>
</status>
</idstatus><content>
<refs>
<norefs></refs>
<descript>
<para0><title>INTRODUCTION</title>
svick
  • 236,525
  • 50
  • 385
  • 514
Daedalus
  • 539
  • 2
  • 6
  • 16

2 Answers2

0

The problem in this case isn't that it's required an xml declaration, but the content from the 2nd line to the 6th line. Infact due to the fact that they aren't a valid xml, your code is unable to parse them. A tricky way could be skipping that lines:

string content = String.Join("", File.ReadAllLines().Skip(6).ToArray());
MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(content));
XElement root = XElement.Load(ms);

Then if that content is followed by a valid xml you shouldn't have any other problem, but as I tried, it seems to be invalid.

Look here for the XML 1.0 Recommendations to create a valid XML file.

Omar
  • 16,329
  • 10
  • 48
  • 66
0

The XML parser is not complaining because you have DOCTYPE declaration, it's complaining because you have incorrect DOCTYPE declaraion. According to the XML specification PUBLIC has to be followed by two strings (“PubidLiteral” and “SystemLiteral”), not just one.

But I think there is no point trying to fix the file, since it contains sections like:

<qa>
<firstver type="tabtop"></qa>

Having unclosed tags like this is okay in SGML (and HTML), but it is not allowed in XML. Because of that, I think you shouldn't try to use LINQ to XML to parse this file, since it really isn't XML.

But it would make sense to use LINQ to XML if you could use an implementation of XmlReader that could actually read SGML. And SGMLReader, mentioned in a comment by Alex Filipovici, seems to be exactly that.

svick
  • 236,525
  • 50
  • 385
  • 514