0

I'm trying to scrape https://www.rspca.org.uk/findapet#onSubmitSetHere to get a list of all pets for adoption.

I've built web scrapers before using crawler4j but the websites were static.

Since https://www.rspca.org.uk/findapet#onSubmitSetHere is not a static website, how can I scrape it? Is it possible? What technologies should I use and how?

Update:

When you fill in the search form (Select type of pet and Enter postcode/town or county) in the UI, the results are then displayed below the search box.

enter image description here

The red is highlighted as the search bar and the black is highlighted as results.

I'm trying to scrape the results and also the content of each result.

I've had a look at the request the browser makes to retrieve results, but from Chrome dev tools it isn't obvious what the request is being made.

breaktop
  • 1,899
  • 4
  • 37
  • 58
  • Which info are you trying to scrape from https://www.rspca.org.uk/findapet#onSubmitSetHere – undetected Selenium Feb 19 '22 at 00:16
  • I'm trying to scrape the results and also the content of each result. I've updated the question with an image of the search query and result. The result are the animals available for adoption. – breaktop Feb 19 '22 at 09:58

1 Answers1

0

You could use Selenium to extract information from the DOM once a browser has rendered it, but I think a simpler solution is to use "developer tools" to find the request that the browser makes when the "search" button is clicked, and try to reproduce that.

In this case that makes a POST to https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search

The body of the POST request contains a lot of parameters, including animalType and location. The content-type of the request is application/x-www-form-urlencoded.

To see these parameters, go to the "Network" tab in chrome dev tools, click on the "findapet" request (it's the first one in the list when I do this), and click on the "payload" tab to see the query string parameters and the form parameters (which contains animalType and location)

The response contains HTML.

I would try making a request to that endpoint and then parsing the HTML in the response.

tgdavies
  • 10,307
  • 4
  • 35
  • 40
  • I've had a look at the request the browser makes to retrieve results, but from Chrome dev tools it isn't obvious what the request is being made. For example where in the URL or the payload the `animalType` and `location` is being placed. – breaktop Feb 19 '22 at 09:56
  • I've added more details above. – tgdavies Feb 19 '22 at 22:18