Accessing password-protected news sites with valid password through c / libcurl

Question

I've written a website crawler in C using libcurl that can scrape text content from any website if we wished.

However, what we need is to be able to crawl password-protected websites, such as large news publishers, with valid subscriptions. So we have a username / password for these sites.

Can anybody offer advice on achieving this using libcurl. I'm aware you can add in the username/password into libcurl options. I thought that doing this, and simply accessing the right page that may be password protected, would be all there is to it. Here's an excerpt of the CURL code:

curl_easy_setopt(curlTestHandle, CURLOPT_URL, "mypasswordprotectedwebsiteurl");
curl_easy_setopt(curlTestHandle, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);
curl_easy_setopt(curlTestHandle, CURLOPT_FOLLOWLOCATION, 1);
curl_easy_setopt(curlTestHandle, CURLOPT_MAXREDIRS, 5);

curl_easy_setopt(curlTestHandle, CURLOPT_USERPWD, "myusername:mypassword");

res = curl_easy_perform(curlTestHandle);
curl_easy_getinfo (curlTestHandle, CURLINFO_RESPONSE_CODE, &httpResponse);

However, perhaps I'm simplifying it too much? And perhaps it may work with some websites, but not others? Has anybody done and achieved a similar thing?

Thanks,

Manoj

If you consider using Python, there are lots of tools for this kind of thing. (Selinium,Mechanize, etc..) And it would be easier to get up and running, although speed could be an issue — Indy9000, May 22 '12 at 10:10
Seconding Indeera's comment. Unless you need some sort of cross platform/device compatibility (such as doing this on a mobile device) then you'd be well advised to use Python. I prototype everything screen scraping related in Python (using Urllib / Urllib2 / BeautifulSoup) and only when required move to libcurl for on device usage. — jmc, May 22 '12 at 10:53

score 0 · Answer 1 · answered May 22 '12 at 10:12

0

It depends. If the site generates different number for the hidden field in the login form then you have to parse (or simple search) the HTML file to append the hidden field to the request. Otherwise, you can hard code the value in your code.

It also depends on how many different website you want to crawl. If there are many different website, then the best thing to do is parse the HTML (or somehow read the form with string searching) and obtain the fields to be filled in when log in.

You also have to read and set cookie. I think libcurl should have function to handle this easily.

--- I am sleepy and I may be ranting off-topic. If this post does not help at all, please tell me to remove it.

answered May 22 '12 at 10:12

nhahtdh

55,989
15
126
162

Your suggestion provided some useful pointers, thank you. Having had a go at this, it appears that you can take the cookie sent by the response header during authentication. Then store this and basically send this in new request to access content behind the paywall. However, this concept doesn't seem to work for all websites as I've found! So I'm still stuck! – Manoj Solanki May 31 '12 at 10:06
What I'm finding is that the cookie that is sent in the request after authentication is different (actually larger with more data) than the original cookie sent in the response! – Manoj Solanki May 31 '12 at 10:07
@ManojSolanki: Sometimes, the websites have a stricter condition to serve content. You may want to use tools such as Firebug in Firefox, or built-in developer tool in Safari/Chrome to observe the traffic and headers. You can try to replicate those information in the request and see whether the server serve the request. – nhahtdh May 31 '12 at 10:20
Yes that's exactly what I've been doing. What I've found is that some javascript code that looks like it writes an initial cookie, presumably on the user's computer. This is then used in subsequent requests, and forms part of the larger cookie that I saw was being sent. So unfortunately, with websites like this, I can't see how I can get past the paywall if it uses javascript to write an initial cookie. Some websites however I think will be crawl-able, just not all of them, unless I'm giving up too easily?! – Manoj Solanki May 31 '12 at 14:22
If they uses JS to set cookie, then depending on the complexity of the code, it may be impossible to crawl. Have you tried to replicate the JS code execution to set cookie? – nhahtdh May 31 '12 at 15:40

Accessing password-protected news sites with valid password through c / libcurl

1 Answers1