Get an element from a large html file

Question

I have a big html file (80 mo) like :

<html>
   <head>...</head>
   <body>
      <div class="nothing">...</div>
      <div class="content">
         <h1>Hello</h1>
         <div>
            <div class="phone"> ... </div>
            <div class="phone"> ... </div>
            <div class="phone"> ... </div>
         </div>
         <div>
            <div class="phone">
               ...
               <div>
                  ...
               </div>
               ...
            </div>
            <div class="phone"> ... </div>
         </div>
         <div>
            <div class="phone"> ... </div>
            <div class="phone"> ... </div>
            <div class="phone"> ... </div>
            <div class="phone"> ... </div>
         </div>
      </div>
   </body>
</html>

I can't modify this html file manually, so the best is that it stays read-only.

I would like to store each line of <div class="phone"> ... </div> in a table of string to be able to manipulate it later. Inside that div, there are also other elements that can be anything.

I tried to use HtmlDocument and XmlDocument to load this file but the file is so big that i get an Out of Memory exception
I tried to use Regex to get all those elements in a table but i couldn't manage it.

The regular expression that i used is:

Regex.Matches(myHtml, "<div class=\"phone\">[\\p{L}\\s]*\\,*[\\p{L}\\s]*<div");

this regex takes every

<div class="phone"> ANY UTF8 char </div>

but the problem is : this regex takes all UTF8 char untill it finds the next </div> but this closing div is not necessarily the closing div of the first opening div.

Any ideas how i can make this? Can't we cut this file in different string to be able to load it in a htmlDocument?

Thanks.

Have you tried using the forward-only `XmlReader` or `XPathNavigator` methods? — Dai, Oct 07 '16 at 23:01
XPathNavigator gets my vote; it should have the lowest memory footprint because you're querying the file instead of loading the entire thing into memory. — brandonscript, Oct 07 '16 at 23:02
I get a problem with XPathNavigator. Something like "hexadecimal value 0x0C, is an invalid character". To solve it, it needs to use Regex.Replace... but i get a out of memory exception... (http://stackoverflow.com/questions/21053138/c-sharp-hexadecimal-value-0x12-is-an-invalid-character) So, i used XmlReader and it's working :) thanks! — Volkan, Oct 07 '16 at 23:37

NineBerry · Accepted Answer · 2016-10-08T20:01:28.153

2

You can use the XmlReader class to read the file. XmlReaderdoes not load the whole file into memory but allows you to move through the XML document node by node while parsing the document on the fly.

Example on how to read the content of all divs with class = phone:

using (XmlReader reader = XmlReader.Create(@"C:\A.html"))
{
     // Loop over all xml tags 
     while (reader.Read())
     {
          // Check we have a div whith attribute class = phone
          if(reader.Name == "div" && reader.GetAttribute("class") == "phone")
          {
               // Yes, so read until the corresponding closing tag and output content
               textBox1.AppendText(reader.ReadInnerXml() + Environment.NewLine);
          }
     }
}

For more details refer to the documentation.

edited Oct 08 '16 at 20:01

answered Oct 08 '16 at 10:44

NineBerry

26,306
3
62
93

The problem with this code is that the ReadInnerXml() has the same kind of function as the Read(). Both move to the next tag. For example, the ReadInnerXml occur, the pointer will go on the next
and then when we arrive on the condition while, it jump again to the next tag... so it misses 1 tag...
– Volkan Oct 08 '16 at 19:18
I edited your code if you don't mind. Thank you for your help. – Volkan Oct 08 '16 at 19:28
1

Still needed improvement – NineBerry Oct 08 '16 at 20:02
Actually, ReadInnerXml is equivalent to calling Read. ReadInnerXml reads untill the closing tag and changes the pointer of the reader to the next tag. After the first loop, the execution of reader.Read() changes again the pointer of the reader to the next tag. So meanwhile a tag is skipped.
...

this div is skipped

...
– Volkan Oct 08 '16 at 21:40
1

That's not true. ReadInnerXml eads to the end of the current tag. That is true. But only the call to Read does then make the next tag the current one. So no tag is lost with the current code version. – NineBerry Oct 08 '16 at 21:47

score -1 · Answer 2 · answered Oct 08 '16 at 10:24

You can loop all elements with the class phone with jQuery and store them in a HiddenField. Then on PostBack you can access those values and process them.

<asp:HiddenField ID="HiddenField1" runat="server" />

<script type="text/javascript">
    function getValues() {
        var valueArray = new Array();
        var valueString = "";
        $(".phone").each(function (index, element) {
            //for demo store both in hiddenfield and javascript array
            valueArray.push(element.innerHTML);
            valueString += element.innerHTML + ",";
        });
        $("#<%=HiddenField1.ClientID %>").val(valueString);
    }
</script>

And in code-behind:

    protected void Button1_Click(object sender, EventArgs e)
    {
        string valueString = HiddenField1.Value;
        if (!string.IsNullOrEmpty(valueString))
        {
            string [] valueArray = valueString.TrimEnd(',').Split(',');
            foreach (string s in valueArray)
            {
                //do stuff
            }
        }
    }

Get an element from a large html file

2 Answers2