0

To practice, I am trying to scrape the following website that displays data in multiple pages. Unfortunately, I am constantly getting a 500 Internal Server Error for each page every time I try to parse the data, in this case contained in the td tags.

Here's my attempt so far. The urls are built correctly but I constantly get a 500 error so I cannot structure the data and build a dataframe that contains the td tags from each page. Any ideas on how to solve this? Thanks!

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.linguasport.com/futbol/nacional/liga/seekff_esp.asp?pn={}"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}

dfs = []
for page in range(1, 20):
    #print(page)
    #print(url.format(page))
    sleep(randint(1,5))
    soup = BeautifulSoup(requests.get(url.format(page), headers=headers).content, "html.parser")
    #print(soup)
    tds = soup.find_all("td")
    #print(tds)
console.log
  • 177
  • 2
  • 16
  • quite odd! I tried replicating it with `rvest` with `r` and I get the same error (but oddly, only for ones with page numbers) – Mark Jul 12 '23 at 05:32
  • Does this answer your question? [Error 500 Web Request Can't Scrape WebSite](https://stackoverflow.com/questions/27761069/error-500-web-request-cant-scrape-website) – Mark Jul 12 '23 at 05:33

1 Answers1

0

The website you mentioned is looking for cookies. It also sends cookies with validation only for requests without queries. So, you can first get valid cookies and then send subsequent requests retaining those cookies.

import requests
import time

base_url = "https://www.linguasport.com/futbol/nacional/liga/seekff_esp.asp"

first = True
with requests.Session() as session:
    for pg_num in range(1, 5):
        if first:
            res = session.get(base_url)
            first = False
        else:
            res = session.get(f'{base_url}?pn={pg_num}')
        print(res.status_code) # 200
        time.sleep(3)

It's also better to add some delay between requests. And it worked without headers so I didn't include that.

Reyot
  • 466
  • 1
  • 3
  • 9
  • Thanks for your answer. This works to get a valid response but as soon as I try to grab the data after the second page, I get a 500 Internal server error. I can share my code below. – console.log Jul 12 '23 at 11:34
  • I think I didn't explain properly. You need to use session.get() instead of requests.get() not together. If you use requests.get() if won't send cookies. And I used `res` as short for `response` – Reyot Jul 12 '23 at 11:39