0

I want to get website's inner text through code.

I can already get it's inner html with code below, but i can't find any code that's getting URL's inner text without webbrowser.

This code is getting text from website in webbrowser, but i need same thing, just without webbrowser.

Dim sourceString As String = WebBrowser1.Document.Body.InnerText
Stefan Đorđević
  • 565
  • 1
  • 4
  • 22
  • Do you mean it's opening a webrower and you don't want to use the webbrosers(internet explorer)? – marshal craft Jan 28 '17 at 14:57
  • No, i want to get website's text (not html) through code, without using webbrowser – Stefan Đorđević Jan 28 '17 at 14:58
  • Maybe you could use a webbrower that also exports components in an api? If not you'll have to use some socket implementation and most likely also use openssl like thing to hand the https. Then it's just a get request which will return all the html. – marshal craft Jan 28 '17 at 15:00
  • See http://stackoverflow.com/questions/92522/http-get-in-vb-net second answer, voted 20 up. – marshal craft Jan 28 '17 at 15:02
  • Are you wanting to collect the readable text rendered in a browser for something like SEO or word count analysis?? – Mike Bateman Jan 28 '17 at 17:42
  • [extracting just page text using HTMLAgilityPack](http://stackoverflow.com/q/19343231/1115360) might be useful to you. – Andrew Morton Jan 28 '17 at 18:07
  • the term you are looking for is "screen scraping". That should lead you to the tools you need – jmoreno Jan 28 '17 at 20:14

2 Answers2

2

With HtmlAgilityPack...

Private Sub ToolStripButton1_Click(sender As Object, e As EventArgs) Handles ToolStripButton1.Click
    Dim doc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
    With New Net.WebClient
        doc.LoadHtml(.DownloadString("https://example.com"))
        .Dispose()
    End With

    Debug.Print(doc.DocumentNode.Name)
    PrintChildNodes(doc.DocumentNode)

    Debug.Print(doc.DocumentNode.Element("html").Element("body").InnerText)
End Sub

Sub PrintChildNodes(Node As HtmlAgilityPack.HtmlNode, Optional Indent As Integer = 1)
    For Each Child As HtmlAgilityPack.HtmlNode In Node.ChildNodes
        Debug.Print("{0}{1}", String.Empty.PadLeft(Indent, vbTab), Child.Name)
        PrintChildNodes(Child, Indent + 1)
    Next
End Sub
MrGadget
  • 1,258
  • 1
  • 10
  • 19
0

**Taken from ** Wolfwyrd

In this question HTTP GET in VB.NET

 Try
Dim fr As System.Net.HttpWebRequest
Dim targetURI As New Uri("http://whatever.you.want.to.get/file.html")         

fr = DirectCast(HttpWebRequest.Create(targetURI), System.Net.HttpWebRequest)
If (fr.GetResponse().ContentLength > 0) Then
    Dim str As New System.IO.StreamReader(fr.GetResponse().GetResponseStream())
    Response.Write(str.ReadToEnd())
    str.Close(); 
End If   

Catch ex As System.Net.WebException 'Error in accessing the resource, handle it End Try

You will get Html as well as http headers. Don't think this will work by itself with https.

Community
  • 1
  • 1
marshal craft
  • 439
  • 5
  • 18