Scraping sitemap index URLs for status code with Beautiful Soup

Question

I'm trying to write a script given the following instruction:

Scrape all the URLs into a sitemap index and store the information into an Excel file, specifying for each URL the corresponding status code.

I managed to scrape all URLs and store them into a file, by the way, I'm still struggling to find a way to get the status codes.

import requests
from bs4 import BeautifulSoup

url = 'https://www.example-domain.it/sitemap_index.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))
urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)
def extract_links(url):

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]

    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    r = requests.get(url)
    code = r.status_code
    sitemap_urls += links
    sitemap_urls += code

print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))

import pandas as pd
df = pd.DataFrame(sitemap_urls)
df.to_excel("Export_link.xlsx")

Can anyone please help me to fix this script?

I get this error:

TypeError: 'int' object is not iterable

The problem is in the following lines:

sitemap_urls = []
    for url in urls:
        links = extract_links(url)
        r = requests.get(url)
        code = r.status_code
        sitemap_urls += links
        sitemap_urls += code

If I just write:

sitemap_urls = []
    for url in urls:
        links = extract_links(url)
        sitemap_urls += links

The script correctly exports all the URL inside the sitemap_index but I have to go a step further: I'd like to get, for each URL contained in the sitemap, the respective status code.

How can I arrange this iteration to make it happen?

Hi Marco! Yes, I get the following error: TypeError: 'int' object is not iterable — Alessandra Mosconi Romitelli, May 13 '20 at 07:05
With the full output I would be sure but it is probable that you have an error here: urls = [element.text for element in sitemap_index.findAll('loc')] .. because seems like is not containing what you expec, from the error it looks like urls it's containing a number e not a list of urls like you expect — Marco, May 13 '20 at 07:08
Welcome to Stack Overflow! Please [edit] your post to include any additional information you have to your question. Avoid adding this in the comments, as they are harder to read and can be deleted easier. The edit button for your post is just below the post's tags. — rizerphe, May 13 '20 at 07:29

αԋɱҽԃ αмєяιcαη · Accepted Answer · 2020-05-13T13:31:30.813

1

import requests
from bs4 import BeautifulSoup
import csv


def main(url):
    with requests.Session() as req:
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        links = [item.text for item in soup.select("loc")]
        with open("data.csv", 'w') as f:
            writer = csv.writer(f)
            writer.writerow(["Url", "Status Code"])
            for link in links:
                r = req.get(link)
                print(link, r.status_code)
                writer.writerow([link, r.status_code])
                soup = BeautifulSoup(r.content, 'html.parser')
                end = [item.text for item in soup.select("loc")]
                for a in end:
                    r = req.head(a)
                    print(a, r.status_code)
                    writer.writerow([a, r.status_code])


main("https://www.nemora.it/sitemap_index.xml")

edited May 13 '20 at 13:31

answered May 13 '20 at 10:09

αԋɱҽԃ αмєяιcαη

11,825
3
17
50

Great! This works fine for single Sitemap, but in my case I'm trying to iterate throught all the URLs listed in a Sitemap Index (example: https://www.nemora.it/sitemap_index.xml). – Alessandra Mosconi Romitelli May 13 '20 at 10:58
@AlessandraMosconiRomitelli so your question were unclear from the start then, by that shape, we will just move from issue to another. **So** are you looking to deal with that specific site only ? – αԋɱҽԃ αмєяιcαη May 13 '20 at 12:07
No, I was talking about Sitemap Index (which is a collection of sitemaps) form the start, that was just an example. – Alessandra Mosconi Romitelli May 13 '20 at 12:23

Scraping sitemap index URLs for status code with Beautiful Soup

1 Answers1