1

I've got LibCURL getting page's source from the web, going through it and picking data out.

Everything is working great bar one page. I had this problem during offline testing while using ifstream and the page source saved to a .html file. basically what's happening i think is the web page renders html + data, the parts I want through js calls (not 100% sure of this) so its not directly rendered in the source.

How I got around this in offline testing was to download the full web page as a offline mode file on Safari, I believe it was called a .webarchive file? This way when I viewed it as source code the html and data was rendered in the source.

I've trolled the internet for an answer but can't seem to find one, can anyone help me here on a setting in curl to download the webpage in its "fullness"?

Here is what options I use currently.

curl_easy_setopt(this->curl, CURLOPT_URL, url);
curl_easy_setopt(this->curl, CURLOPT_FOLLOWLOCATION, 1);
curl_easy_setopt(this->curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0");
curl_easy_setopt(this->curl, CURLOPT_COOKIEFILE, "cookies.txt");
curl_easy_setopt(this->curl, CURLOPT_COOKIEJAR, "cookies.txt");
curl_easy_setopt(this->curl, CURLOPT_POSTFIELDS, postData); // if needed
curl_easy_setopt(this->curl, CURLOPT_WRITEFUNCTION, this->WriteCallback);
curl_easy_setopt(this->curl, CURLOPT_WRITEDATA, &readBuffer);
res = curl_easy_perform(this->curl);
durron597
  • 31,968
  • 17
  • 99
  • 158
Makka
  • 234
  • 4
  • 13
  • when using firefox's inspect element i can get the source fine. iv tracked it down and it seems to be JS, there are two div tag's the 1st is shown when its loading and the second is shown when its loaded. also the js is automatically called on a GET variable sent in the URL eg page.php?a=1&b=2 it just takes a few seconds after loading to show?
    – Makka Oct 02 '13 at 02:45

1 Answers1

1

You would have to parse the html and download every single hypertext reference in the document.

When Safari downloads the webpage, it dumps everything relating to that page that is actively cached in to a .webarchive which contains local references for all of the images, css, and js files. That means it just gives you the page in its loaded form with all of the images within the archive, and it differs from the actual source.

You could do a string search for href= and src= (after removing every single space in the document) and get the URLs for most of them that way.

Some href and src tags will have relative links, not absolute ones. So be sure to check the beginning for http:// otherwise you'd have to use the path in your url variable and concatenate the strings.

The only problem with this is content that is dynamically loaded through JavaScript or CSS (which you mentioned in passing), which will make it difficult, because you'd have to also dig through those files for references to that content.

Good luck!

Ryan Willis
  • 624
  • 3
  • 13
  • 1
    Not only parsing the JS, but you might also have to actually execute it to see how it manipulates the content of the web page, especially if it using DOM interfaces to do it. So yeah, getting the "full" source using libcurl alone is not enough, since all it will receive is the HTML's static content, not its dynamic content. – Remy Lebeau Oct 02 '13 at 02:29
  • when using firefox's inspect element i can get the source fine. iv tracked it down and it seems to be JS, there are two div tag's the 1st is shown when its loading and the second is shown when its loaded. also the js is automatically called on a GET variable sent in the URL eg page.php?a=1&b=2 it just takes a few seconds after loading to show?
    – Makka Oct 02 '13 at 02:42
  • Ah, didn't think about AJAX. When you use `libcurl` you're essentially using HTTP GET. Tracking down the url of the content that is being queried can be done by parsing the JS. Even then, you'd have to figure out if the AJAX was done via jQuery or standard JavaScript, or even another toolkit. And then insert that seperate query into the originally returned HTML. – Ryan Willis Oct 02 '13 at 02:52
  • how would one do that since the the content is session only accessed? – Makka Oct 02 '13 at 03:00