0

I am trying to get URLs from a search using the website semana.com. After carefully replicating results from these previous searches (1) (2) (3), I have come up empty. The website uses queryly, which manages their searches and prevents one from scraping HTML (or Javascript) attributes.

Taking elements from an inspected link using Chrome's devtools on this search page (https://www.semana.com/buscador/?query=refugiado), I run the following code:

import pandas as pd
import numpy as np
import csv
import json
import requests
import lxml
from tqdm import tqdm
from bs4 import BeautifulSoup, SoupStrainer

params = {
    "queryly_key": "06e63be824464567",
    "query": "refugiado",
    "endindex": "0",
    "batchsize": "20",
    "callback": "searchPage.resultcallback",
    "showfaceted": "true",
    "extendeddatafields": "creator,imageresizer,promo_image",
    "timezoneoffset": "360"
}

#goal = ["", ""]

def main(url):
    with requests.Session() as req:
        r = req.post(url, params=params)
        r = str(r)
        for page, item in enumerate(range(0, 20, 20)):
            print(f"Extracting Page# {page +1}")
            params["endindex"] = item
            for loop in r:
                soup = BeautifulSoup(loop)
                target = soup.find("div", class_ = "queryly_item_row")
                #url = target.a.get("href")
                print(soup)

main("https://api.queryly.com/json.aspx")

While it runs, it returns a list of "None"s. Ideally, it would return a list of URLs that I can then scrape. I am open to any solutions, and I appreciate your help in advance. I am more familiar with R, but I have been using Python for web-scraping recently.

Yaroslavm
  • 1,762
  • 2
  • 7
  • 15
tripple
  • 1
  • 1
  • I have also tried using Docker in combination with scrapy-splash and rendering HTML sessions in requests-HTML, to no avail. – tripple Aug 31 '23 at 05:53

0 Answers0