1

I'm a Java engineer with zero dev ops experience. Lately I was playing around with linux ubuntu server first time and used docker with my selenium project and faced this problem:

I try to scrape HTML from a website, but my calls are getting blocked, and I get 403 forbidden response. I tried to curl same website and also get same response.

Furthermore, I only get blocked in my Linux machine, everything works in local dev env with same docker image, so thats why I think its "server fault".

Any ideas what my Linux server is missing here? Maybe I don't have some sort of certificate or have cors problem? Any ideas, what can I try? (For learning purposes only)

curl call here

  • Pass the web browser and your curl and Java apps through a proxy like mitmproxy and check the request, especially the headers. I am sure will will see the differences that cause the web server to send different responses. – Robert Jan 31 '22 at 20:31
  • 3
    Not really on topic for ServerFault, getting selenium and curl commands to work is more StackOverflow. But most likely: the site tries to detect scrapers and uses mechanisms like cookies and sessions to identify real interactive users/browsers. – Bob Jan 31 '22 at 20:36
  • @Bob I would say it's ServerFault, because it works with my local machine with same docker image. – Vytautas Šerėnas Feb 01 '22 at 06:28
  • @Robert appreciate your suggestion, I'm going to investigate and update this question. – Vytautas Šerėnas Feb 01 '22 at 06:30
  • Just being the servers fault doesn't make it on topic for ServerFault. If this is your server you are trying to scrape, provide your server configuration and log files and we can try to help you. If this is not your server, it's off topic here. And in that case, I'd stop doing what you are doing. Now you are just getting a 403, the next notice might be from a lawyer. – Gerald Schneider Feb 04 '22 at 09:14
  • As I mentioned, I'm a total noob in this and I can provide any config files which you think could help. Basically, at this point, I don't know what I don't know. Had no idea this can be illegal, but I don't think that few calls in a day could lead to these consequences, I don't have a server running and spamming calls. Definitely, I'm now more cautious and will do my research about this too. I also would like to mention that my main purpose is to learn trough practice, and I don't have any other goal here than just understanding "how I'm being recognized and blocked". Thanks – Vytautas Šerėnas Feb 04 '22 at 09:41

1 Answers1

1

I believe you're getting rate-limited or blocked by the website. If I run the same curl command from my laptop, I get the webpage back.

Remember to respect robots.txt if you're doing web scraping.

shearn89
  • 3,403
  • 2
  • 15
  • 39
  • Did not know about robots.txt, great findings, thanks. I had no idea about rate-limiting, but I think it's not the case, because from the start after deploy first call was blocked. – Vytautas Šerėnas Feb 04 '22 at 09:16