0

I'm trying to scrape a webpage that contains a table of test results using Python and BeautifulSoup, At this point I don't mind if its just raw html/un parsed data.

There is a table of results all contained within a parent DIV tag called 'test-view-grid-area'.

I got the class of name of the DIV tag from inspecting the webpage within chrome, and when viewing source of webpage its definitely correct, but when I run the below code, my results come back as:

[<div class="test-view-grid-area"></div>]

So it appears to be finding the tag but not returning its contents? I am not sure what I need to do to get the contents of the DIV class returned.

from bs4 import BeautifulSoup
import urllib3
http = urllib3.PoolManager()
url = '[url of server / webpage]')
response = http.request('GET', url, headers=headers)
soup = BeautifulSoup (response.data, 'html.parser')
grid_data = soup.find_all("div", class_="test-view-grid-area")
print(grid_data)

Edit: I've gotten a little further, I am now getting the following response directly from the script tag that returns a JSON string:

[<script class="__allSuitesOfSelectedPlan" defer="defer" type="application/json">
{"selectedOutcome":"","selectedTester":{"displayName" <etc>}</script>]

So next now I am trying to figure out how to do some regex to create my search pattern for everything between {}, then run that pattern against my initial data scrape, and then load the json string into a object.

Dekks
  • 137
  • 12
  • 2
    the page is likely dynamic. You'll need to either a) examine the XHR and maybe get he data returns in json format, b) that data in json format can be within ` – chitown88 Apr 03 '19 at 12:04
  • Possible duplicate of [web scraping dynamic content with python](https://stackoverflow.com/questions/17608572/web-scraping-dynamic-content-with-python) – ivan_pozdeev Apr 03 '19 at 12:09
  • @chitown88 Ok I have done some more inspection and just scraped the entire page and looked at BS's output and found the following that the data I want is in the following tag: – Dekks Apr 03 '19 at 13:16
  • @chitown88 I assume I need to now strip the actual <> tag parts somehow and then use the json module to load it into a dictionary? – Dekks Apr 03 '19 at 13:22
  • 1
    Yes. You'll have to do some string manipulation to shape that into a valid json format. Then you'll use `json.loads(json_str)` to turn that string into dictionary. Is it possible to share with me that element and I can show you in the solutions? Can you paste it up above in your original post. – chitown88 Apr 03 '19 at 13:39
  • @chitown88 Appreciate your continued assistance. I've added a few more details to the original post. – Dekks Apr 03 '19 at 14:51
  • 1
    can you get that stored as a string? ie. using `.text`? from BeautifulSoup? – chitown88 Apr 03 '19 at 14:56
  • Ah didn't realise it was that simple, doing ````data = json.loads(grid_data.text)```` worked and contains just the {} json string. I still need to now figure out how to output that in a useful format, but thats more than enough for me to do some further reading and figure that part out on my own, or at least attempt to. Many Thanks for your help. If you want to post as a answer happy to mark as accepted. – Dekks Apr 04 '19 at 07:45

0 Answers0