6

I have directory with > 1000 .html files, and would like to check all of them for bad links - preferably using console. Any tool you can recommend for such task?

Hubert Kario
  • 21,314
  • 3
  • 24
  • 44

4 Answers4

4

you can use wget, eg

wget -r --spider  -o output.log http://somedomain.com

at the bottom of the output.log file, it will indicate whether wget has found broken links. you can parse that using awk/grep

ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • An alternative **wget** command line to check for broken links can be found in [this answer](http://stackoverflow.com/a/15029100/1497596). Also note that a comment that I left on that answer provides a link to **wget for Windows**. – DavidRR Sep 16 '14 at 20:39
2

I'd use checklink (a W3C project)

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • As long as you are careful to set the user agent and accept headers (to avoid bogus error codes from bot detectors) this should work. – Tim Post Mar 15 '10 at 11:41
  • It would look ok, but it's definitely not intended for such large projects - it doesn't have any way to just list broken links, and output for my project is *really* big. –  Mar 15 '10 at 13:25
0

You can extract links from html files using Lynx text browser. Bash scripting around this should not be difficult.

mouviciel
  • 66,855
  • 13
  • 106
  • 140
0

Try the webgrep command line tools or, if you're comfortable with Perl, the HTML::TagReader module by the same author.

gareth_bowles
  • 20,760
  • 5
  • 52
  • 82