I have a big html file (80 mo) like :
<html>
<head>...</head>
<body>
<div class="nothing">...</div>
<div class="content">
<h1>Hello</h1>
<div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
</div>
<div>
<div class="phone">
...
<div>
...
</div>
...
</div>
<div class="phone"> ... </div>
</div>
<div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
</div>
</div>
</body>
</html>
I can't modify this html file manually, so the best is that it stays read-only.
I would like to store each line of <div class="phone"> ... </div>
in a table of string to be able to manipulate it later. Inside that div, there are also other elements that can be anything.
- I tried to use HtmlDocument and XmlDocument to load this file but the file is so big that i get an Out of Memory exception
- I tried to use Regex to get all those elements in a table but i couldn't manage it.
The regular expression that i used is:
Regex.Matches(myHtml, "<div class=\"phone\">[\\p{L}\\s]*\\,*[\\p{L}\\s]*<div");
this regex takes every
<div class="phone"> ANY UTF8 char </div>
but the problem is : this regex takes all UTF8 char untill it finds the next </div>
but this closing div is not necessarily the closing div of the first opening div.
Any ideas how i can make this? Can't we cut this file in different string to be able to load it in a htmlDocument?
Thanks.