1

I have an automatically generated sitemap for a large website which contains a number of URLs that are causing 404 errors which I need to remove. I need to generate a report based on only the URLs that are in the sitemap and not crawl errors caused by bad links on the site. I can not see any way of filtering the crawl error reports to only include these URLs. Does anyone know of a way that I can achieve this?

Thanks

TGuimond
  • 5,475
  • 6
  • 41
  • 51

3 Answers3

2

I'm not sure you can do it easily from webmaster tools, but it is trivial to check them all yourself. Here is a perl program that will accept a sitemap file and check each line, printing each url along with its status.

#!/usr/bin/perl
use strict;
require LWP::UserAgent;
my $ua = LWP::UserAgent->new;
while (my $line = <>){
    if ($line =~ /\<loc\>(.*?)\<\/loc\>/){
        my $url = $1;
        my $response = $ua->get($url);
        my $status = $response->status_line;
        $status =~ s/ .*//g;
        print "$status $url\n";
    }
}

I save it as checksitemapstatus.pl and use it like this:

$ /tmp/checksitemap.pl /tmp/sitemap.xml 
200 http://example.com/
404 http://example.com/notfound.html
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
2

Nothing natively within WMT. You'll want to do some Excel.

  1. Download the list of busted links
  2. Get your list of sitemap links.
  3. Put them side by side.
  4. Use a VLOOKUP to match columns (http://www.techonthenet.com/excel/formulas/vlookup.php)
  5. As a bonus, use some conditional formatting to make it easier to see if they match. Then, sort by colour.
Bob C
  • 529
  • 1
  • 3
  • 10
2

You can also import the sitemap.xml into A1 Website Analyzer and let it scan them. See: http://www.microsystools.com/products/website-analyzer/help/crawl-website-pages-list/

After that, you can filter scan results by e.,g. 404 response code and export that to CSV if need to be. (Including if-so-wanted from where they are linked.)

Tom
  • 3,587
  • 9
  • 69
  • 124