13

Has anyone integrated BeautifulSoup with ASP.NET/C# (possibly using IronPython or otherwise)? Is there a BeautifulSoup alternative or a port that works nicely with ASP.NET/C#

The intent of planning to use the library is to extract readable text from any random URL.

Thanks

user300981
  • 1,423
  • 5
  • 13
  • 16

3 Answers3

18

Html Agility Pack is a similar project, but for C# and .NET


EDIT:

To extract all readable text:

document.DocumentNode.InnerText

Note that this will return the text content of <script> tags.

To fix that, you can remove all of the <script> tags, like this:

foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    style.Remove();

(Credit: SLaks)

Martin Liversage
  • 104,481
  • 22
  • 209
  • 256
Colin Pickard
  • 45,724
  • 13
  • 98
  • 148
  • How would I use HAP for scraping readable text from a HTML page. In BeautifulSoup, it's very easy to do this. – user300981 Jul 28 '10 at 21:44
  • Does the DocumentNode.InnerText get all the text within the tags. My worry is that I need to support this for URLs that do not follow any standard. There might be gunk all over. Is HAP smart enough to distinguish between readable text and irrelevant HTML tags, comments, client scripts – user300981 Jul 30 '10 at 13:43
  • HAP is pretty smart at detecting what text will be output by a browser, but of course many sites these days will make a lot of changes to the text visible in the final render with css, javascript and images. So really the only true way to determine what is a person could read when the page is rendered by a browser, would be to render it in a browser... – Colin Pickard Jul 30 '10 at 13:55
3

I know this is quite old, but I decided to post this for future reference. I came across this searching for a similar solution.

I found a library built on top of Html Agility Pack called scrapysharp

I've used it in quite similar manner as I would BeautifulSoup https://bitbucket.org/rflechner/scrapysharp/wiki/Home (EDIT: broken link, project moved to https://github.com/rflechner/ScrapySharp)

EDIT: https://www.nuget.org/packages/ScrapySharp/ has the package

Oligoglot
  • 53
  • 4
Yavor Shahpasov
  • 1,453
  • 1
  • 12
  • 19
2

You could try this although it currently has a few bugs:

http://nsoup.codeplex.com/

Adam
  • 4,590
  • 10
  • 51
  • 84