1

I'm trying to write a script given the following instruction:

Scrape all the URLs into a sitemap index and store the information into an Excel file, specifying for each URL the corresponding status code.

I managed to scrape all URLs and store them into a file, by the way, I'm still struggling to find a way to get the status codes.

import requests
from bs4 import BeautifulSoup

url = 'https://www.example-domain.it/sitemap_index.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))
urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)
def extract_links(url):

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]

    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    r = requests.get(url)
    code = r.status_code
    sitemap_urls += links
    sitemap_urls += code

print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))

import pandas as pd
df = pd.DataFrame(sitemap_urls)
df.to_excel("Export_link.xlsx")

Can anyone please help me to fix this script?

I get this error:

TypeError: 'int' object is not iterable

The problem is in the following lines:

sitemap_urls = []
    for url in urls:
        links = extract_links(url)
        r = requests.get(url)
        code = r.status_code
        sitemap_urls += links
        sitemap_urls += code

If I just write:

sitemap_urls = []
    for url in urls:
        links = extract_links(url)
        sitemap_urls += links

The script correctly exports all the URL inside the sitemap_index but I have to go a step further: I'd like to get, for each URL contained in the sitemap, the respective status code.

How can I arrange this iteration to make it happen?

Edric
  • 24,639
  • 13
  • 81
  • 91
  • Are you getting any error or any output when you run this ? – Marco May 13 '20 at 07:03
  • Hi Marco! Yes, I get the following error: TypeError: 'int' object is not iterable – Alessandra Mosconi Romitelli May 13 '20 at 07:05
  • With the full output I would be sure but it is probable that you have an error here: urls = [element.text for element in sitemap_index.findAll('loc')] .. because seems like is not containing what you expec, from the error it looks like urls it's containing a number e not a list of urls like you expect – Marco May 13 '20 at 07:08
  • Welcome to Stack Overflow! Please [edit] your post to include any additional information you have to your question. Avoid adding this in the comments, as they are harder to read and can be deleted easier. The edit button for your post is just below the post's tags. – rizerphe May 13 '20 at 07:29
  • I edited the post with the information you asked for :) – Alessandra Mosconi Romitelli May 13 '20 at 08:32

1 Answers1

1
import requests
from bs4 import BeautifulSoup
import csv


def main(url):
    with requests.Session() as req:
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        links = [item.text for item in soup.select("loc")]
        with open("data.csv", 'w') as f:
            writer = csv.writer(f)
            writer.writerow(["Url", "Status Code"])
            for link in links:
                r = req.get(link)
                print(link, r.status_code)
                writer.writerow([link, r.status_code])
                soup = BeautifulSoup(r.content, 'html.parser')
                end = [item.text for item in soup.select("loc")]
                for a in end:
                    r = req.head(a)
                    print(a, r.status_code)
                    writer.writerow([a, r.status_code])


main("https://www.nemora.it/sitemap_index.xml")