-2

I am new to this subject, so my question could prove stupid.. sorry in advance. My challenge is to do web-scraping, say for this page: link (google)

I try to web-scrape it using Python, My problem is that once I use Python requests.get, I don't seem to get the full content of the page. I guess it is because that page has many resources, and Python does not get them all. (more than that, once I scroll my mouse up - more data is reviled on Chrome. I can see from the source code that no more data is downloaded to be shown..) How can I get the full content of a web page? what am I missing?

thanks

Uri
  • 47
  • 1
  • 6
  • requests is a *HTTP client*; it fetches the resource on a server based on the URL, by making a HTTP connection. A browser is much, much more than a HTTP client. A browser is built on top of a HTTP client to fetch resources, then render those resources. That includes parsing HTML, loading referenced resources (CSS, images, scripts), executing scripts, and the scripts can trigger more resources to be loaded, etc. Requests doesn't do any of those things because it is not a browser. – Martijn Pieters Jul 17 '18 at 11:34
  • You either need to analyse what the browser is doing with the resources it is receiving (the browser developer tools can help there, look at the network tab to see what requests are being sent to the server, perhaps you can just make those directly) or you need to use something that does the same thing as a browser. The [`requests-html` library](http://html.python-requests.org/) does some of the later, for example. – Martijn Pieters Jul 17 '18 at 11:37

1 Answers1

-1

requests.get will get you the page web but only what the page decides to give a robot. If you want the full page web as you see it as a human you need to trick it by changing your headers. If you need to scroll or click on buttons in order to see the whole page web, which is what I think you'll need to do, I suggest you take a look at selenium.

Pixel
  • 101
  • 7
  • While it's true that some sites alter behaviour based on headers and other information (which they are free to do), for the vast majority of websites this is not usually the case. The real issue is the difference between a HTTP client and a browser, selenium is one way of driving the latter from Python but not the only way. – Martijn Pieters Jul 17 '18 at 11:39
  • Fair enough but I'm not sure why the thumb down since the accepted answer of the duplicate question has the exact same answer I gave, which is to use selenium. – Pixel Jul 17 '18 at 11:43
  • Thanks for your answer, Are you sure my problem is that the content is not returned since I use bot? should there not be a problem retrieving information from multiple sources? (that could be downloaded in run-time) – Uri Jul 17 '18 at 11:46
  • Yes there could be, especially in the page that you linked which is probably a bunch of javascript generating the content as you go and trying to update it in live time. Selenium will help you with that. – Pixel Jul 17 '18 at 11:49