wget and pretty urls

Question

In order to automatize stuff, I need to recursively download a web page. I'm using wget as it's probably the most programmer-friendly tool available, using -r flag to trigger link following.

wget, hovewer, doesn't handle pretty urls, i.e. http://webpage/index.php/my/pretty/link, treating them as subdirectories.

Is there a solution to this problem? (I'd rather not modify that web page's source code)

Cheers, MH

EDIT: Problem solved

Thanks for your insightful replies!

I've managed to solve this problem - by doing minor modification to the mentioned web page, though.

What I did was simple: I've used my server's url rewritting features and redirected urls from http://webpage/my/pretty/link to http://webpage/index.php/my/pretty/link. Then, using following wget flags:

wget --mirror --page-requisites --html-extension --convert-links http://webpage/

Voila! It all works flawlessly (there are directiories created in the process, still, but it's trivial to handle it from this point with some sort of script).

kubanczyk · Answer 1 · 2009-12-10T22:06:51.683

3

Well, how wget is supposed to know if index.php/my/pretty is actually not a directory? This is not obvious at all from HTTP client's perspective.

Maybe you can wget --exclude-directories to work around this? Or maybe check wget -nd, which will create a flat set of files (not a directory tree). Check these out.

edited Dec 10 '09 at 22:06

answered Jun 26 '09 at 19:28

kubanczyk

13,812
5
41
55

score 1 · Answer 2 · answered Jun 26 '09 at 19:35

Pretty URLs aren't typically self-reliant, they more frequently are using a mechanic to pass data back and forth (tpically via POST or cookies) to an MVC framework-based application on the backend.

If you're using multiple wget calls, it's worth noting that wget uses cookies but does not, by default, save them... meaning that each wget is going to start with a fresh cookie, and won't have the state information available. the --save-cookies (filename) and --load-cookies (filename) options will help you there.

If the web application is using POST as a mechanic I'd guess that you're going to likely have to write a specific crawler tailored to that site.

score 1 · Answer 3 · answered Jun 26 '09 at 21:04

Maybe you can use Firefox with the iMacros addon instead of wget? It has command line support, but can not follow links automatically (you would need to script that).

http://wiki.imacros.net/iMacros_for_Firefox#Command_Line_Support

I use it to download various reports daily.

score -1 · Answer 4 · answered Jun 26 '09 at 19:34

If your getting the same site many times then you could consider the alias command, you could make an alias with a friendly name to wget with full path name

alias mywget='wget http://domain.com/file/?search&channel=24'

obviously add any switches you need then your peeps can just run mywget to do the function

although im not sure what will happen once it hits the & as you would normally put a url like that in its on quotes

hope that helps

wget and pretty urls

4 Answers4