0

Link:https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0

This website has questions in image form that I need to scrape. However I cannot even get a link to their source and it outputs links to some loading gifs. When I saw the source code, there weren't even any "src" to the images. You can see how the website works on the link provided above. How can I download all these images?

from bs4 import BeautifulSoup
import requests
import os

url = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

images = soup.find_all('img')

for image in images:
    link = image['src']

    print (link)
  • You need to use `selenium` to navigate to the site and click on questions. Once you click on a question the image link will appear in the source. – Chris Dec 16 '20 at 16:09

2 Answers2

0

As the page is dynamic BeautifulSoup doesn't work here. Have to use selenium

  1. Navigate to the site
  2. Get all questions using xpath: //div/div[3]/center/table/tbody/tr/td[1]/center/a and loop and click on them.
  3. Get the image source using xpath: //*[@id="question_prev"]/div[2]/img/@src then get and save the image.
Harish Vutukuri
  • 1,092
  • 6
  • 14
0

The question id's are embedded as part of the page, try extracting the id using the re(regex) module.

import re
import requests
from bs4 import BeautifulSoup

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}

URL = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"
BASE_URL = "https://www.exam-mate.com"

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

for tag in soup.select("td:nth-of-type(1) a"):
    # Find the question id within the page
    question_link = re.search(r"/questions.*\.png", tag["onclick"]).group()
    print(BASE_URL + question_link)

Output:

https://www.exam-mate.com/questions/1240/1362/1240_q_1362_1_1.png
https://www.exam-mate.com/questions/1240/1363/1240_q_1363_2_1.png
https://www.exam-mate.com/questions/1240/1364/1240_q_1364_3_1.png
https://www.exam-mate.com/questions/1240/1365/1240_q_1365_4_1.png
https://www.exam-mate.com/questions/1240/1366/1240_q_1366_5_1.png
...And on
MendelG
  • 14,885
  • 4
  • 25
  • 52
  • Sorry to bother you but why is it not working on - Link: https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&chapter=&paper=&unit=&zone=&level=&order=asc&offset=2000 On pages beyond this the code doesn't seem to be working. – Piyush Gehlot Dec 17 '20 at 03:08
  • @PiyushGehlot Well, I don't see any images on the new link that you have provided. – MendelG Dec 17 '20 at 03:56
  • This page is almost the same as the first one. Idk why you can't see the images. Where can you not see the image, webpage or the source code? The source code seems to be the same. Btw thankyou so much for replying. – Piyush Gehlot Dec 17 '20 at 04:17
  • @PiyushGehlot Since there's no source code of any images they also don't render on the page. [See a screenshot,](https://i.stack.imgur.com/GIHMR.png) I don't see any images on the page.. – MendelG Dec 17 '20 at 05:27
  • oh, these questions are available with subscription and i have it(which is why it is visible to me). So how do i get these now? should i do something different? – Piyush Gehlot Dec 17 '20 at 05:41
  • There's no way for me to answer without seeing the source code of the page. – MendelG Dec 17 '20 at 05:43
  • can i give you the id and password? i'm ok with that – Piyush Gehlot Dec 17 '20 at 05:55
  • @PiyushGehlot Sorry, I won't be able to help you with that. _note:_ Stackoverflow is a Q&A site. – MendelG Dec 17 '20 at 05:59
  • oh, np. Thankyou so much for guiding me though. – Piyush Gehlot Dec 17 '20 at 06:03