Can I use WGET to generate a sitemap of a website given its URL?

Question

I need a script that can spider a website and return the list of all crawled pages in plain-text or similar format; which I will submit to search engines as sitemap. Can I use WGET to generate a sitemap of a website? Or is there a PHP script that can do the same?

score 43 · Accepted Answer · answered Jul 19 '11 at 13:15

43

wget --spider --recursive --no-verbose --output-file=wgetlog.txt http://somewebsite.com
sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&amp;@" > sedlog.txt

This creates a file called sedlog.txt that contains all links found on the specified website. You can use PHP or a shell script to convert the text file sitemap into an XML sitemap. Tweak the parameters of the wget command (accept/reject/include/exclude) to get only the links you need.

answered Jul 19 '11 at 13:15

Salman A

262,204
82
430
521

1

+1 Couldn't quite use it like that as it was giving me a bunch of errors (probably because of different wget/sed versions). But once I did some tweaking, it worked like a charm. Thanks! – Julian Aug 12 '11 at 17:22
2

You should add a small delay between requests using `--wait=1`, otherwise it might affect the performance of the site. – Liam Sep 18 '14 at 11:08
Combined with `tee` https://unix.stackexchange.com/a/128476/312058 you can also see the output in stdout OR `tail -f` is even better – Phani Rithvij Apr 17 '21 at 14:43
@Julian Yes, I had the same issue. On macOS, I had to use `gsed` instead of the builtin `sed`. Thanks for the tip! – GDP2 Apr 21 '21 at 23:14

score 2 · Answer 2 · answered Oct 16 '10 at 12:58

2

You can use this perl script to do the trick : http://code.google.com/p/perlsitemapgenerator/

answered Oct 16 '10 at 12:58

Gilles Quénot

173,512
41
224
223

It'll generate by scanning file system but won't "crawl". The sites I want to spider are dynamic. – Salman A Oct 16 '10 at 13:26

Can I use WGET to generate a sitemap of a website given its URL?

2 Answers2

Linked