1

I want to know how to find sitemap in each domain and sub domain using python? Some examples:

abcd.com/sitemap.xml
abcd.com/sitemap.html
abcd.com/sitemap.html
sub.abcd.com/sitemap.xml

And etc.

What is the most probable sitemap names, locations and also extensions?

3 Answers3

1

Please take a look at the robots.txt file first. That's what I usually do.

Some domains do offer more than one sitemap and there are cases with more than 200 xml files.

Please remember that according to the FAQ on sitemap.org, a sitemap file can be gzipped. Consequently, you might want to consider sitemap.xml.gz too!

Klaus-Dieter Warzecha
  • 2,265
  • 2
  • 27
  • 33
  • i want to know all possible names and extensions, based on that you said it's might be .xml.gz and i know that we have some sitemaps with .html and .xml extensions and also in many websites we don't have robots.txt or in robots.txt we don't have any sitemap value what do you thinking about this situations and I mean what is the most names and extensions (I need a list of most used names, extensions and the locations for sitemap) ? – William Johnson Oct 27 '19 at 14:24
  • Any robots.txt parser, such as https://github.com/scrapy/protego, should be able to extract the sitemap URLs from a given `robots.txt` file. If a website does not have a `robots.txt` file, either it has no sitemap or their sitemap may be outdated, in which case you are better off ignoring the sitemap. – Gallaecio Oct 29 '19 at 11:11
  • Thank you dear @Gallaecio for your comment, what do yo thinking about try to predict sitemap name and location to find sitemaps? I use crawling but I think if I find the sitemap in this situation i don't need to crawl entire website? what is your opinion? i want to find away to find very pages by a good recombination of my methods. – William Johnson Oct 29 '19 at 20:55
  • “what do yo thinking about try to predict sitemap name and location to find sitemaps?” As I said, if a sitemap is not in `robots.txt`, it’s likely outdated (i.e. the website decided to stop generating a sitemap, but did not bother to remove it). It is up to you whether an outdated sitemap is better than no sitemap for your use case. – Gallaecio Oct 30 '19 at 08:57
1

I've used a small function to find sitemaps by the most common name.

Sitemap naming stats: https://dret.typepad.com/dretblog/2009/02/sitemap-names.html

def get_sitemap_bruto_force(website):
    potential_sitemaps = [
        "sitemap.xml",
        "feeds/posts/default?orderby=updated",
        "sitemap.xml.gz",
        "sitemap_index.xml",
        "s2/sitemaps/profiles-sitemap.xml",
        "sitemap.php",
        "sitemap_index.xml.gz",
        "vb/sitemap_index.xml.gz",
        "sitemapindex.xml",
        "sitemap.gz"
    ]

    for sitemap in potential_sitemaps:
        try:
            sitemap_response = requests.get(f"{website}/{sitemap}")
            if sitemap_response.status_code == 200:
                return [sitemap_response.url]
            continue
        except:
            continue

As I retrieve sitemap index I'll send it to a recursive function to find all links from all sitemaps.

def dig_up_all_sitemaps(website):
    sitemaps = []
    index_sitemap = get_sitemap_paths_for_domain(website)

    def recursive(sitemaps_to_crawl=index_sitemap):    
        current_sitemaps = []

        for sitemap in sitemaps_to_crawl:
            try:
                child_sitemap = get_sitemap_links(sitemap)
                current_sitemaps.append([x for x in child_sitemap if re.search("\.xml|\.xml.gz|\.gz$",x)])
            except:
                continue
        current_sitemaps = list(itertools.chain.from_iterable(current_sitemaps))
        sitemaps.extend(current_sitemaps)
        if len(current_sitemaps) == 0:
            return sitemaps
        return recursive(current_sitemaps)
    return recursive()

get_sitemap_paths_for_domain returns a list of sitemaps

Tajs
  • 521
  • 7
  • 18
0

You should try using the URLLIB robotsparser

import urllib.robotparser

robots = "branndurl/robots.txt"
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots)
    rp.read()
    rp.site_maps()

This will give you all the sitemaps in the robots.txt

Most of the sites are havig the sitemaps present there.

  • Does anyone have any solution to find sitemaps when they are not present in robots.txt? – Tajs Jan 03 '23 at 12:57