0

I have a site whose content is in French language.

Now I want to get these through HttpWebRequest and HttpWebResponse in console application using c#.

public string GetContents(string url)
{
    StreamReader _Answer;
    try
    {
        HttpWebRequest WebReq = (HttpWebRequest)WebRequest.Create(url);
        WebReq.Headers.Add(HttpRequestHeader.AcceptEncoding, "utf-8");
        WebReq.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0;Windows NT 5.1;)";
        WebReq.ContentType = "application/x-www-form-urlencoded";
        HttpWebResponse WebResp = (HttpWebResponse)WebReq.GetResponse();
        Stream Answer = WebResp.GetResponseStream();
        Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
        _Answer = new StreamReader(Answer, Encoding.UTF8);
        return _Answer.ReadToEnd();
    }
    catch
    {
    }
    return "";
}

I get the content but it contain some strange symbol like squares etc.

jgauffin
  • 99,844
  • 45
  • 235
  • 372
Aamir
  • 1
  • 1

1 Answers1

4

Are you sure the web server is responding with UTF-8 encoding?

Update:

The web server from which you are trying to download is serving the pages with a character encoding of ISO-8859-1 and not UTF-8.

You have to (a) change your hard coded content type or (b) read the content type from the server response and use that.

Albireo
  • 10,977
  • 13
  • 62
  • 96
  • Yes when i see the source of the url it mention UTF-8 as a content type. Please see in this url http://www.gites-de-france.com/location-vacances-bennwihr-gite--,gites68_b2011.1.68G3550.G.html – Aamir Jul 01 '11 at 07:32
  • 1
    That's irrelevant. Check the HTTP headers: the content-type [is set](http://i.imgur.com/U6TOm.png) to `text/html; charset=iso-8859-1`, not to `utf-8`. – Albireo Jul 01 '11 at 07:35
  • yes i check the url the contenct type in it is text/html; charset=UTF-8 – Aamir Jul 01 '11 at 07:54
  • http://www.gites-de-france.com/location-vacances-bennwihr-gite--,gites68_b2011.1.68G3550.G.html – Aamir Jul 01 '11 at 07:59
  • Check what? The content type is still `text/html; charset=iso-8859-1`, so the character encoding is `iso-8859-1` and not `utf-8`. – Albireo Jul 01 '11 at 08:06