1

Most of the site sources, opens with a simple request, usualy by file_gets_contents() or curl_init().

I've tried a lot of combinations of stream_context_create() and curl_setopt(), and none returned any thing different of 400 bad request.

Is there an explanation for why some server-sites ( like https://phys.org/ ) do not return de source code by quoted methods?

obs.: if you were able to get the source of the exemple ( https://phys.org/ ), using file_gets_contents() or curl_init(), or any other method with php, please post the code, thanks.

1 Answers1

1

Some Website's are validating the request if it comes from a real/allowed client (bot/user).
This can have multiple reasons.

Maybe the bots are sending to many requests, or the specific site is blocked behind a paywall/firewall. But there are many other people who can explain it to you better then me.

Here are some known Examples how they did it:

Some Site's are supporting request with an API-Token.
Google API's are an great example.

Some Site's are validing the User-Agent.
It looks like that your example site is doing this.
When I'm sending a custom User-Agent Header the result is returning to an error.

And Of Course can some site's check for the User IP Address :)

I believe in your example there should be a good solution to get a result.

Sysix
  • 1,572
  • 1
  • 16
  • 23
  • The site exemple,has also an feed under the link https://phys.org/rss-feed/, so it's somehow interesting that other sources get it's content. The feed link also return 400 bad request... It's not about avoid bots or harvesting content. – Marlon Augusto Dec 29 '20 at 13:50