-1

I have an HTML Table as below:

<table border='1' width='100%'>
<tr>
<td>
<table border='1' width='100%'>
<tr>
    <th>
        <p>Title2</p>
    </th>
</tr>
<tr>
    <th>
        <div>Content2</div>
    </th>
</tr>
</table>
</td>

<td>
<table border='1' width='100%'>
<tr>
    <th>
        <p>Hello Title1</p>
    </th>
</tr>
<tr>
    <th>
        <div>Hello content 1</div>
    </th>
</tr>
</table>
</td>
</tr>
</table>

I am making a Windows application to read all titles and show them in the list. When the user presses any title from the list it needs to show the content of selected table.

Q: How can I read all the titles and display them without using HTMLAgilityPack or any other parsers?

So far I have done this:

        WebClient wc = new WebClient();
        System.IO.Stream stream = wc.OpenRead(strFilePath);
        StreamReader sReader = new StreamReader(stream);
        string strTables = sReader.ReadToEnd();
        if (!string.IsNullOrEmpty(strTables))
        { 
            //code to parse html tables
        }

As you have noticed title is inside the <p> element and content is inside the <div> element. Any ideas?

VMai
  • 10,156
  • 9
  • 25
  • 34
firefalcon
  • 500
  • 2
  • 8
  • 21
  • Why don't you want to use a parser? Parsing HTML might seem easy but it's definitely not. – Mick Aug 16 '14 at 06:30
  • I've done it using AgiliPack. But my employers want me to not use any ready parsers. – firefalcon Aug 16 '14 at 09:14
  • Your employers should be doing the programming if they're making those kinds of decisions. Back seat programming from people who are technically illiterate is a really bad idea. They should rely on you to choose the most effective implementation – Mick Aug 16 '14 at 09:53

2 Answers2

1

HTML is of course also XML, so why not use XmlReader?

After that, use all the XmlDocument methods and LINQ you can to find what you are looking for. It will provide you better flexible, maintainable, efficient code than anything you'd have to write by hand.

Of course if you mean 'no external HTML parsers'.

Pieter21
  • 1,765
  • 1
  • 10
  • 22
0

Even though it's not the best practice to parse HTML's with Regex, it is and option:

Patterns:

<p>.*</p>
<div>.*</div>

Example:

    WebClient wc = new WebClient();
    System.IO.Stream stream = wc.OpenRead(strFilePath);
    StreamReader sReader = new StreamReader(stream);
    string strTables = sReader.ReadToEnd();
    if (!string.IsNullOrEmpty(strTables))
    { 
        // I'm not a regex master but I'm sure there is a way to get the title without the <p> elements.
        var pMatches = Regex.Matches(strTables, "<p>.*</p>"));
        foreach(var pMatch in pMatches)
        {
           string title = pMatch.Replace('<p>',string.Empty).Replace('</p>', string.Empty);
        }
    }
Amir Popovich
  • 29,350
  • 9
  • 53
  • 99
  • You can use `var matches = Regex.Matches(html, "(?<=<(?:p|div)>).*?(?=(?:p|div)>)");` to grab all of the `

    ` and `

    ` titles in one shot so you don't need that `foreach` after.
    – NathanW Aug 16 '14 at 06:35
  • Understatement of the year to say it is not best practice. You will have lots of problems with greedy matching, string manipulations etc. – Pieter21 Aug 16 '14 at 06:46
  • @Amir Popovich How to skip `

    ` elements that are outside the table and they are NOT titles.

    – firefalcon Aug 16 '14 at 09:45