I'm trying to write a script given the following instruction:
Scrape all the URLs into a sitemap index and store the information into an Excel file, specifying for each URL the corresponding status code.
I managed to scrape all URLs and store them into a file, by the way, I'm still struggling to find a way to get the status codes.
import requests
from bs4 import BeautifulSoup
url = 'https://www.example-domain.it/sitemap_index.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))
urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)
def extract_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
links = [element.text for element in soup.findAll('loc')]
return links
sitemap_urls = []
for url in urls:
links = extract_links(url)
r = requests.get(url)
code = r.status_code
sitemap_urls += links
sitemap_urls += code
print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))
import pandas as pd
df = pd.DataFrame(sitemap_urls)
df.to_excel("Export_link.xlsx")
Can anyone please help me to fix this script?
I get this error:
TypeError: 'int' object is not iterable
The problem is in the following lines:
sitemap_urls = []
for url in urls:
links = extract_links(url)
r = requests.get(url)
code = r.status_code
sitemap_urls += links
sitemap_urls += code
If I just write:
sitemap_urls = []
for url in urls:
links = extract_links(url)
sitemap_urls += links
The script correctly exports all the URL inside the sitemap_index but I have to go a step further: I'd like to get, for each URL contained in the sitemap, the respective status code.
How can I arrange this iteration to make it happen?