1

I am using crawler4j in a very amateur settings to crawl articles from a site (and boilerpipe for content scraping). In some of the sites, the crawler is working very neatly. But in other cases it just fails to fetch the website (though I can still get data using jsoup).

It so happens that from the same site, some page is being fetched and others aren't. It is giving this warning and then skipping the page altogether:

Feb 11, 2016 5:05:31 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: visid_incap_688991=7KCcJ/TxTWSEzP9k6OFX2eZqvFYAAAAAQUIPAAAAAAAHVw5Tx4mHCf3VQHK63tAN; expires=Fri, 09 Feb 2018 15:00:14 GMT; path=/; Domain=.banglatribune.com". Invalid 'expires' attribute: Fri, 09 Feb 2018 15:00:14 GMT

I can understand from this warning that crawler4j is doing something regarding the cookie (it's using CookieSpecs.DEFAULT and I can't change it in any way).

Is there any other way to manage the httpclient without using the crawler4j one?

Is there any way to change the cookie options in crawler4j?

PageFetcher.java in crawler4j creates the httpclient and handles all the cookie options.

Or should I use another crawler which can be customized for sites which use bad formats in their cookie?

Any help will be very much appreciated.

d1xlord
  • 239
  • 3
  • 4
  • 12
  • I'm struggling with authentication/cookies and crawler4j myself... as far as I can see it's not designed to manage the http client by youself. You could clone the repo and re-write the PageFetcher in order to do so and suggest the change to the dev-team of crawler4j. Unfortunately I'm also not aware of any (for me) useful alternative. Maybe they are useful alternatives for you: nutch, or scrapy (python). If you find others please leave a comment – divadpoc Feb 18 '16 at 07:52
  • I am currently using webmagic crawler, it doesn't have some basic configurations (like depth for crawling, max number of pages to crawl) but is very easy to put in the features you want as it's design is very nice. – d1xlord Feb 18 '16 at 08:52
  • Just a tip about webmagic, the full user guide is written in chinese. You can always use google translate though and for me thats good enough to understand the underlying concept. – d1xlord Feb 18 '16 at 08:54
  • I checked it out, and the design is pretty nice, agreed. I extended it in order to use authentication, but that's still not working, but despite of that it's a pretty good project. thanks for the hint. – divadpoc Feb 18 '16 at 11:21

1 Answers1

0

the HTTPClient is handled from inside the crawler4j source, so there is no way to change or manipulate any configurations regarding the httpclient (which includes cookie specification) when using this library.

d1xlord
  • 239
  • 3
  • 4
  • 12