I have made a web-scraper for Google Scholar in Java with JSoup. The scraper search Scholar for a DOI and finds the citations for this paper. This data is needed for a research.
But, the scraper only works for the first requests. .. After that the scraper encounters a captcha on the Scholar site.
However, when I open the website in my browser (Chrome) Google Scholar opens normally.
How is this possible? All request come from the same IP-address! So far I have tried the following options:
- Choose a random user-agent for the request (from a list of 5 user-agents)
- Random delay between request between 5- 50 seconds
- Use a TOR-proxy. However almost all the end-nodes have already been blocked by Google
When I analyse the request made by Chrome to Scholar I see that a cookie is used with some session ID's. Probably this is why Chrome requests are not blocked. Is it possible to use this cookie for request made with JSoup?
Thank you!