I am using crawler4j in a very amateur settings to crawl articles from a site (and boilerpipe for content scraping). In some of the sites, the crawler is working very neatly. But in other cases it just fails to fetch the website (though I can still get data using jsoup).
It so happens that from the same site, some page is being fetched and others aren't. It is giving this warning and then skipping the page altogether:
Feb 11, 2016 5:05:31 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: visid_incap_688991=7KCcJ/TxTWSEzP9k6OFX2eZqvFYAAAAAQUIPAAAAAAAHVw5Tx4mHCf3VQHK63tAN; expires=Fri, 09 Feb 2018 15:00:14 GMT; path=/; Domain=.banglatribune.com". Invalid 'expires' attribute: Fri, 09 Feb 2018 15:00:14 GMT
I can understand from this warning that crawler4j is doing something regarding the cookie (it's using CookieSpecs.DEFAULT and I can't change it in any way).
Is there any other way to manage the httpclient without using the crawler4j one?
Is there any way to change the cookie options in crawler4j?
PageFetcher.java in crawler4j creates the httpclient and handles all the cookie options.
Or should I use another crawler which can be customized for sites which use bad formats in their cookie?
Any help will be very much appreciated.