crawler4j: website bans my IP address for few minutes after 20-30 seconds of crawl

Question

I've trying to crawl a website at mystore411.com using open source crawler4j.

The crawler works fine for a limites period of time (say 20-30 seconds) and then website bans my address for few minutes before I can crawl again. I couldn't figure out a possible solutions.

I went through its robots.txt and here is what I got from that:

User-agent: Mediapartners-Google 
Disallow:

User-agent: *
Disallow: /js/
Disallow: /css/
Disallow: /images/

User-agent: Slurp
Crawl-delay: 1

User-agent: Baiduspider
Crawl-delay: 1

User-agent: MaxPointCrawler
Disallow: /

User-agent: YandexBot
Disallow: /

Please suggest if there is any alternate.

I would suggest that they have a reason for stopping your crawler. So, without their permission, any workaround is an abuse of their resources. — The Head Rush, Oct 15 '14 at 18:39
@TheHeadRush Yeah. You are right. But still, is there any possible workaround looking at robots.txt? — user3311019, Oct 15 '14 at 19:25
So... you know you are stealing resources and still want someone's help with that? — The Head Rush, Oct 15 '14 at 19:56

score 1 · Accepted Answer · answered Oct 16 '14 at 05:37

I can't tell u the exact reason why they banned you. But I can tell u some reasons why an IP gets banned.

1) Your politeness delay in Crawl Controller code may be too low.

  * Expalnation:- Politeness delay is the time that you set as the gap between two          
                  consecutive requests. The more u reduce the delay the more no. of 
                  requests will be send to the server increasing server work load. SO keep 
                  an appropriate politeness delay.(default 250 ms, use this command 
                  config.setPolitenessDelay(250);

2) Reduce the no. of Crawler threads

 * Explanation:- Almost the same reason as above.

3) Don't crawl through robot's.txt

 * Explanation:- Set your robottxtenable to false in order to not to get blocked by the
                 domain's robot's.txt.(config.setResumableCrawling(false);

4) Try to use a good user agent agent:-

  * Exaplantion:- https://en.wikipedia.org/wiki/User_agent.

@user3311019 K! Best of Luck – Sudhir kumar Oct 18 '14 at 07:03 — Sudhir kumar, Oct 18 '14 at 07:03

crawler4j: website bans my IP address for few minutes after 20-30 seconds of crawl

1 Answers1