Download file from a dynamically generated link which lies in the source code of an HTML

Question

I am trying to get the Weather data from BOM Australia. The manual way is to go to http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=2064 and click 'All years of data', and it downloads the file!

Here's what I have tried to automate this:

using (WebClient client = new WebClient())
            {

                string html = client.DownloadString("http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=2064");


                List<string> list = LinkExtractor.Extract(html);
                foreach (var link in list)
                {
                    if (link.StartsWith("/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile"))
                    {

                        string resource = "http://www.bom.gov.au" + link;
                        MessageBox.Show(resource);


                        client.DownloadFileAsync(new Uri(resource), Dts.Connections["data.zip"].ConnectionString);
                        break;
                    }
                }




            }

Don't worry about the linkExtractor, it works as I am able to see the link that gives the file. The problem is that the 'DownloadFileAsync' creates a new request which does not let the file to get downloaded since the file needs the same session.

Is there a way I can do this? Please reach out for more clarification.

UPDATE:

Here are the changes I made, utilising cookies from HttpWebRequest. However, I am still not able to download the file.

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=2064");
            request.CookieContainer = new CookieContainer();

            HttpWebResponse response = (HttpWebResponse)request.GetResponse();

            foreach (Cookie cook in response.Cookies)
            {
                MessageBox.Show(cook.ToString());
            }

            if (response.StatusCode == HttpStatusCode.OK)
           {
                Stream receiveStream = response.GetResponseStream();
                StreamReader readStream = null;

                if (response.CharacterSet == null)
                {
                    readStream = new StreamReader(receiveStream);
                }
                else
                {
                    readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
                }

                string data = readStream.ReadToEnd();



                using (WebClient client = new WebClient())
                {
                    foreach (Cookie cook in response.Cookies)
                    {
                        MessageBox.Show(cook.ToString());
                        client.Headers.Add(HttpRequestHeader.Cookie, cook.ToString());
                    }

                    List<string> list = LinkExtractor.Extract(data);
                    foreach (var link in list)
                    {
                        if (link.StartsWith("/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile"))
                        {

                            string initial = "http://www.bom.gov.au" + link;
                            MessageBox.Show(initial);

                            //client.Headers.Add(HttpRequestHeader.Cookie, "JSESSIONID=2EBAFF7EFE2EEFE8140118CE5170B8F6");
                            client.DownloadFile(new Uri(initial), Dts.Connections["data.zip"].ConnectionString);
                            break;
                        }
                    }




                }

                response.Close();
                readStream.Close();
            }

Can you please elaborate how using cookies would help, since there are no user credentials that are required for browsing the website? — Vikas Dhochak, Aug 11 '16 at 06:31
Because some sites care about their content and take some measures to prevent easy scraping. Some might require a session cookie, some generate unique urls on each GET, some need a referrer, some run javascript and do a couple of ajax requests . If you are able to download the file successfully with a browser, you only have to mimic that. The webclient is not going to do that on its own. Use the developer console of your browser to figure out what is needed in subsequent http calls. — rene, Aug 11 '16 at 06:52
The console shows this when I click to download the file: Resource interpreted as Document but transferred with MIME type application/zip: "http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&p_stn_num=2064&p_c=-938623&p_nccObsCode=136&p_startYear=2016". — Vikas Dhochak, Aug 11 '16 at 06:57
You need to look at the network tab and study the request and response headers ... — rene, Aug 11 '16 at 07:01
I am able to see the request cookie. How do I set the request cookie from first link as request cookie for second link. The 'CookieAwareWebClient' example is not working for me — Vikas Dhochak, Aug 11 '16 at 09:41
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/120685/discussion-between-vikas-dhochak-and-rene). — Vikas Dhochak, Aug 11 '16 at 10:29

rene · Accepted Answer · 2016-08-11T11:03:16.727

The html you get and the url's within that are HtmlEncoded. That makes that when you substring the url out of the html you need to Decode it, ideally. This is what the download url for the zip looks like:

   /jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&amp;p_stn_num=2064&amp;p_c=-938623&amp;p_nccObsCode=136&amp;p_startYear=2016

There is helper class to do the decoding for us: WebUtility

This code does download the zip file:

using (var client = new WebClient())
{
    var url = "http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=2064";    
    string html = client.DownloadString(url);

    var pos = html.IndexOf("/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile");
    var endpos = html.IndexOf('"', pos);
    string link = html.Substring(pos, endpos - pos);

    var decodedLink = WebUtility.HtmlDecode(link);
    string resource = "http://www.bom.gov.au" + decodedLink;                    


    client.DownloadFile(new Uri(resource), @"c:\temp\bom2.zip");

}

In this case you don't need the cookies to be kept but you need to be careful with the URL's you parse.

Download file from a dynamically generated link which lies in the source code of an HTML

1 Answers1