0

I am trying to come up with a neat solution to create automated json schema markup on my aspx pages. The markup in question is FAQPage, but that's irrelevant.

I decided that I needed to scrape the content of the current page to find questions and answers. After a few false starts I came across the HtmlAgilityPack plugin which enables me to achieve what I want, but I've come across some issues.

The HtmlAgililtyPack parser can be initiated in a number of ways, but the only one I could get to work for me and my scenario (scrape current page) was to feed in a string.

First, I created an asp ID with a runat="server" tag.

To get the string, I used HTMLTextWriter; here's the code:

    static string ConvertControlToString(Control ctl)
    {
        string s = null;

        var sw = new StringWriter();
        using (var w = new HtmlTextWriter(sw))
        {
            ctl.RenderControl(w);
            s = sw.ToString();
        }
        return s;
    }

Now, all that works fine - in most cases.

However, I'm running into edge cases where I use scriptmanager and updatepanels. I suspect there will be more. The error is: ... must be inside a form control with a runat="server". Of course it is but the rendercontrol doesn't realise it.

So, two questions:

  1. Is there a way to feed HtmlAgilityPack parser in another way that doesn't require a string (and that won't loop)?
  2. Is there a better way to scrape the text other than Control.RenderControl() that won't cause errors?

Incidentally, I've found a solution to the problem I'm having but it involves manipulating each affected page, and that's not great.

So, thought I'd throw it out there and see if there are better workarounds or a better solution.

John Ohara
  • 2,821
  • 3
  • 28
  • 54

1 Answers1

0

You can load HTML in a few different ways but ultimately HTML is a string so this is what the parser will operate on. I'm not sure what you mean about looping.

Rather than rendering controls as HTML and then parsing them it might be better to let the entire page load and parse it after it has rendered, this allows your javascript/updatepanels to finish transforming the page before you parse the HTML.

The LoadFromBrowser method (I believe) loads the specified url in a headless browser, allows any javascript to run and then parses the resulting HTML: https://html-agility-pack.net/from-browser

If you need to attach authentication credentials there is a question addressing that here: HtmlAgilityPack and Authentication

Alternatively (keeping your existing code) you might try instantiating a new HtmlControl with the tag "form", adding the the control passed in to ConvertControlToString to it and then parsing that which may avoid your error. You may need to check the control doesn't already have a form tag, this approach doesn't address javascript/update panels and I'm not 100% sure it would work.

HtmlGenericControl form = new HtmlGenericControl("form");
Control ctl = new Control();
form.Controls.Add(ctl);
string s = string.Empty;
var sw = new System.IO.StringWriter();
using (var w = new HtmlTextWriter(sw))
{
    form.RenderControl(w);
    s = sw.ToString();
}
JustAnotherDev
  • 546
  • 2
  • 8
  • Thanks for your thorough response. I couldn't get the load from browser method to work - it constantly hung (presumably looping). – John Ohara Apr 30 '21 at 10:26
  • Any luck with the alternative method? I've updated the answer with some example code. I can't replicate the error (likely because my example control has no content) but the code hopefully gets across the approach I'm suggesting. – JustAnotherDev Apr 30 '21 at 10:47
  • I get you now. That's something I would never have thought of, so I'll give it a go. Thanks again. – John Ohara Apr 30 '21 at 17:23
  • No, the fake form doesn't work - still get the same "needs to be in a form with runat=server" message. – John Ohara May 01 '21 at 10:39
  • What is the underlying type of the control throwing the error on render? – JustAnotherDev May 02 '21 at 09:24
  • I'm pretty sure it's a combo ScriptManager/UpdatePanel. The error message identified the ScriptManager, so I took it out of the equation (outside the scraped control) but the error persisted. I removed both controls by calling the process on a control outside of the UpdatePanel and it worked fine. There is a fix but it involves adding additional code to each affected page. It's not ideal because it's not 'automatic'. At the moment I'm scraping all the page content, so a solution could be to be more targeted with the scraping (ignoring update panels etc.) – John Ohara May 03 '21 at 10:14