Parse C# HTML String WITHOUT html parsers like AgilityPack

Question

I have an HTML Table as below:

<table border='1' width='100%'>
<tr>
<td>
<table border='1' width='100%'>
<tr>
    <th>
        <p>Title2</p>
    </th>
</tr>
<tr>
    <th>
        <div>Content2</div>
    </th>
</tr>
</table>
</td>

<td>
<table border='1' width='100%'>
<tr>
    <th>
        <p>Hello Title1</p>
    </th>
</tr>
<tr>
    <th>
        <div>Hello content 1</div>
    </th>
</tr>
</table>
</td>
</tr>
</table>

I am making a Windows application to read all titles and show them in the list. When the user presses any title from the list it needs to show the content of selected table.

Q: How can I read all the titles and display them without using HTMLAgilityPack or any other parsers?

So far I have done this:

        WebClient wc = new WebClient();
        System.IO.Stream stream = wc.OpenRead(strFilePath);
        StreamReader sReader = new StreamReader(stream);
        string strTables = sReader.ReadToEnd();
        if (!string.IsNullOrEmpty(strTables))
        { 
            //code to parse html tables
        }

As you have noticed title is inside the <p> element and content is inside the <div> element. Any ideas?

Why don't you want to use a parser? Parsing HTML might seem easy but it's definitely not. — Mick, Aug 16 '14 at 06:30
I've done it using AgiliPack. But my employers want me to not use any ready parsers. — firefalcon, Aug 16 '14 at 09:14
Your employers should be doing the programming if they're making those kinds of decisions. Back seat programming from people who are technically illiterate is a really bad idea. They should rely on you to choose the most effective implementation — Mick, Aug 16 '14 at 09:53

score 1 · Answer 1 · answered Aug 16 '14 at 06:20

1

HTML is of course also XML, so why not use XmlReader?

After that, use all the XmlDocument methods and LINQ you can to find what you are looking for. It will provide you better flexible, maintainable, efficient code than anything you'd have to write by hand.

Of course if you mean 'no external HTML parsers'.

answered Aug 16 '14 at 06:20

Pieter21

1,765
1
10
22

Html is not XML...you simply can not do that. – Alexei Levenkov Aug 16 '14 at 06:47
@AlexeiLevenkov I'm thinking of parsing string that contains HTML tables. Is there any ideas on how to read string, for example from `
` to `
` – firefalcon Aug 16 '14 at 09:16

Amir Popovich · Accepted Answer · 2014-08-16T06:24:40.623

0

Even though it's not the best practice to parse HTML's with Regex, it is and option:

Patterns:

<p>.*</p>
<div>.*</div>

Example:

    WebClient wc = new WebClient();
    System.IO.Stream stream = wc.OpenRead(strFilePath);
    StreamReader sReader = new StreamReader(stream);
    string strTables = sReader.ReadToEnd();
    if (!string.IsNullOrEmpty(strTables))
    { 
        // I'm not a regex master but I'm sure there is a way to get the title without the <p> elements.
        var pMatches = Regex.Matches(strTables, "<p>.*</p>"));
        foreach(var pMatch in pMatches)
        {
           string title = pMatch.Replace('<p>',string.Empty).Replace('</p>', string.Empty);
        }
    }

edited Aug 16 '14 at 06:24

answered Aug 16 '14 at 06:18

Amir Popovich

29,350
9
53
99

You can use `var matches = Regex.Matches(html, "(?<=<(?:p|div)>).*?(?=(?:p|div)>)");` to grab all of the `
` and `
` titles in one shot so you don't need that `foreach` after.
– NathanW Aug 16 '14 at 06:35
Understatement of the year to say it is not best practice. You will have lots of problems with greedy matching, string manipulations etc. – Pieter21 Aug 16 '14 at 06:46
@Amir Popovich How to skip `
` elements that are outside the table and they are NOT titles.
– firefalcon Aug 16 '14 at 09:45

Parse C# HTML String WITHOUT html parsers like AgilityPack

2 Answers2