0
  Protected Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
        Dim Imagelink As String = ""
        Dim Text As String = TextBox1.Text
        Dim request As HttpWebRequest = DirectCast(HttpWebRequest.Create(Text), HttpWebRequest)
        request.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
        Dim respons As HttpWebResponse
        respons = DirectCast(request.GetResponse(), HttpWebResponse)
        Dim enc As Encoding
        Try
            enc = Encoding.GetEncoding(respons.CharacterSet)
        Catch ex As Exception
            enc = Encoding.GetEncoding("ISO-8859-1")
        End Try

        Dim reader As New StreamReader(respons.GetResponseStream(), enc)
        Dim sr As String = reader.ReadToEnd()

        Dim Pattern As String = "<img([^s]|s[^r]|sr[^c]|src[^=]|src=[^'""])*src=['""](?<SRC>[^'""]*)['""]"
        Dim m As MatchCollection = Regex.Matches(sr, Pattern)
        For Each mm As Match In m
            Dim link_ As String = mm.Groups("SRC").Value
            ' Dim x_ As String = link_.Substring(0, 7)

            If link_.Substring(0, 7) = "http://" Then
                Response.Write(mm.Groups("SRC").Value + "" + "<br>")
                Imagelink = link_
            End If


        Next
        Dim image_ As New Image
        image_.Attributes("src") = Imagelink
        PlaceHolder1.Controls.Add(image_)
    End Sub

This is the code i use to send in a request to a webpage get its content and extract the image links off the webpage. However in some webpages it doesn't return the charset in the web-header it returns "". When i however try to parse with default encoding it doesn't give the proper content either ? this is really frustrating any one has come across this type of situation before ? If anyone could point me in the right direction as to how to overcome this or Predict what kind of encoding to use , it would be great thanks.

Example of site which gives no charset in response header

Image site link

Dasith
  • 1,085
  • 10
  • 16
  • If there are no HTTP headers set, the default is ISO-8859-1, but some web sites set the charset in the meta tag inside the HTML. You could check that too. If that isn't set either, then it's a bad site. Also, your regexp probably could be simplified by using non-greedy ` – Sami Kuhmonen May 13 '14 at 09:58
  • None of the encodings work for that site i checked with the default iso-8859-1 and then utf-8 as what is in their content , none of the encodings worked ? Any other way how to determine it ? It cant be that bad of a site it was on top 10 for Image hosting sites .. – Dasith May 13 '14 at 10:02
  • If you can tell the site, then others can see the situation better. Otherwise it's quite impossible to say. – Sami Kuhmonen May 13 '14 at 10:03
  • I have it posted there on the question itself below.. – Dasith May 13 '14 at 10:05
  • Sorry, I thought it was a screenshot of the actual site you wanted to get content from. – Sami Kuhmonen May 13 '14 at 10:07

1 Answers1

0

In this case the character encoding is given in the headers: Content-Type: text/html; charset=utf-8

Sami Kuhmonen
  • 30,146
  • 9
  • 61
  • 74
  • yes that is the content header and it is not the encoding for the content in the site when i encode it with utf-8 it returns gibberish text. – Dasith May 13 '14 at 10:12