0

Forgive me if I don't use the proper terminology. I have a webpage that I'm trying to scrape information from. The problem is that when I view the page source the data I want to scrape is not there. I've encountered this problem before where the main http request triggers other requests and so the information I'm looking for is actually somewhere else which I find using Google chromes inspect - Network feature. I manually search the various documents and xhr files so the one that has the correct information. This is sometimes long and tedious. I can also use google chromes inspect feature to inspect the element that contains the information I want and that brings up the correct source code but it I can't seem to figure out where or how I can use that to quickly find the corresponding HTTP headers.

Restated in a short - can I use the inspect element feature of google chrome and then ask it to show me the corresponding network event (HTTP request) that produced that code?

I'll add the case study I'm working on.

 http://www.flashscore.com/tennis/atp-singles/acapulco/results/

shows the different matches that took place at a tennis tournament. I'm trying to scrape the the match hrefs but if you view source of the page you'll see they're not there.

Thanks

Vindictive
  • 311
  • 7
  • 19

2 Answers2

1

Restated in a short - can I use the inspect element feature of google chrome and then ask it to show me the corresponding network event (HTTP request) that produced that code?

No. This isn't something that the browser keeps track of.

In most situations, the HTTP response will pass through a good deal of Javascript code before being eventually turned into elements on the page. Tracing which HTTP response was "responsible" for a given element would involve a great deal of data flow analysis, and is impractical for a browser to do.

  • After reading this article: http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/ I found that the data was being processed client side which seems to mean I have to look at the XHR calls but none of them provide a preview or code of the matches I see on the webpage, which I'm assuming is some preventative measure from me scraping the data on the server side? – Vindictive Oct 16 '16 at 19:08
  • That's unlikely to be the case just as a preventative measure. More likely the authors of the site just felt that it'd be easier to have HTML rendering occur on the browser side. –  Oct 16 '16 at 20:04
0

One way:

open firefox, install LiveHttpHeaders, then run it, and you will see the expected HEADERS.

There's the same addon for google chrome, but not tested.

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223