I've written a website crawler in C using libcurl that can scrape text content from any website if we wished.
However, what we need is to be able to crawl password-protected websites, such as large news publishers, with valid subscriptions. So we have a username / password for these sites.
Can anybody offer advice on achieving this using libcurl. I'm aware you can add in the username/password into libcurl options. I thought that doing this, and simply accessing the right page that may be password protected, would be all there is to it. Here's an excerpt of the CURL code:
curl_easy_setopt(curlTestHandle, CURLOPT_URL, "mypasswordprotectedwebsiteurl");
curl_easy_setopt(curlTestHandle, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);
curl_easy_setopt(curlTestHandle, CURLOPT_FOLLOWLOCATION, 1);
curl_easy_setopt(curlTestHandle, CURLOPT_MAXREDIRS, 5);
curl_easy_setopt(curlTestHandle, CURLOPT_USERPWD, "myusername:mypassword");
res = curl_easy_perform(curlTestHandle);
curl_easy_getinfo (curlTestHandle, CURLINFO_RESPONSE_CODE, &httpResponse);
However, perhaps I'm simplifying it too much? And perhaps it may work with some websites, but not others? Has anybody done and achieved a similar thing?
Thanks,
Manoj