3

How to screen scrape HTTPS using C#?

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
Jignesh
  • 165
  • 2
  • 5
  • 13

5 Answers5

5

You can use System.Net.WebClient to start an HTTPS connection, and pull down the page to scrape with that.

Brett Allen
  • 5,297
  • 5
  • 32
  • 62
5

Look into the Html Agility Pack.

RichardOD
  • 28,883
  • 9
  • 61
  • 81
4

You can use System.Net.WebClient to grab web pages. Here is an example: http://www.codersource.net/csharp_screen_scraping.html

zfedoran
  • 2,986
  • 4
  • 22
  • 25
  • 2
    link dead: i think this may be the updated link - http://www.codersource.net/microsoft-net/c-advanced/html-screen-scraping-in-c.aspx – Simon_Weaver Oct 20 '10 at 22:03
2

If for some reason you're having trouble with accessing the page as a web-client or you want to make it seem like the request is from a browser, you could use the web-browser control in an app, load the page in it and use the source of the loaded content from the web-browser control.

Cyril Gupta
  • 13,505
  • 11
  • 64
  • 87
1

Here's a concrete (albeit trivial) example. You can pass a ship name to VesselFinder in the querystring, but even if it only finds one ship with that name it still shows you the search results screen with one ship. This example detects that case and takes the user straight to the tracking map for the ship.

        string strName = "SAFMARINE MAFADI";
        string strURL = "https://www.vesselfinder.com/vessels?name=" + HttpUtility.UrlEncode(strName);
        string strReturnURL = strURL;
        string strToSearch = "/?imo=";
        string strPage = string.Empty;
        byte[] aReqtHTML;


        WebClient objWebClient = new WebClient();
        objWebClient.Headers.Add("User-Agent: Other");   //You must do this or HTTPS won't work
        aReqtHTML = objWebClient.DownloadData(strURL);  //Do the name search
        UTF8Encoding utf8 = new UTF8Encoding();

        strPage = utf8.GetString(aReqtHTML); // get the string from the bytes

        if (strPage.IndexOf(strToSearch) != strPage.LastIndexOf(strToSearch))
        {
            //more than one instance found, so leave return URL as name search
        }
        else if (strPage.Contains(strToSearch) == true)
        {
            //find the ship's IMO 
            strPage = strPage.Substring(strPage.IndexOf(strToSearch)); //cut off the stuff before
            strPage = strPage.Substring(0, strPage.IndexOf("\"")); //cut off the stuff after

        }

        strReturnURL = "https://www.vesselfinder.com" + strPage;
SteveCav
  • 6,649
  • 1
  • 50
  • 52