0

I got a problem with wget, I need to download an entire site with images and other files linked in main pages, I'm using these options:

wget --load-cookies /tmp/cookie.txt -r -l 1 -k -p -nc 'https://www.example.com/mainpage.do'

(-l 1 is used for testing, I may need to travel to level 3 or even 4)

The problem is: I can't figure out how to bypass the 'random' GET parameter that is added after some recursion cycles, so my final result in the /tmp folder is like this:

/tmp/www.example.com/mainpage.do
/tmp/www.example.com/mainpage.do?cx=0.0340590343408
/tmp/www.example.com/mainpage.do?cx=0.0348934786475
/tmp/www.example.com/mainpage.do?cx=0.0032878284787
/tmp/www.example.com/mainpage.do?cx=0.0266389459023
/tmp/www.example.com/mainpage.do?cx=0.0103290334732
/tmp/www.example.com/mainpage.do?cx=0.0890345378478

Since the page it is always the same I don't need to get it other times, I tried with -nc option but it doesn't work, I also tried using -R (reject) but it only works with file extensions, not with URL parameters.

I looked extensively in the wget manual but I don't seem to find a way to do it; it is not mandatory to use wget, if you know how to do it in other ways, they are welcome.

masegaloeh
  • 18,236
  • 10
  • 57
  • 106
Lex
  • 127
  • 1
  • 6
  • WHat are your purposes? The way I see it you've run into the problem of leeching from a site with anti-leeching code. Perhaps you might also tell us how this qualifies as a system administration question. – John Gardeniers Oct 23 '09 at 09:31
  • Actually the site I'm trying to download is that of my own employer, and yes, it has security constraints, I'm allowed to do this, we need a static working copy of the site. – Lex Oct 23 '09 at 12:33
  • I don't know if reallt qualifies as a sysadmin question, I thought it didn't belong to stack overflow and thought about asking it here, sorry if I was wrong, just move it to stackoverflow. – Lex Oct 23 '09 at 12:34
  • It may become a stack overflow question if you take a programmatic route. Otherwise it's probably a SuperUser question. – Kyle Smith Oct 23 '09 at 14:09

1 Answers1

0

Write a local proxy server that modifies the responses sent to wget.

Assuming your URLs are in links such as:

<a href="/path/to/mainpage.do?cx=0.0123412341234">

Then you can run a Ruby proxy server like this:

require 'webrick/httpproxy'
s = WEBrick::HTTPProxyServer.new(
   :Port => 2200,
   :ProxyContentHandler => Proc.new{|req,res|
      res.body.gsub!(/mainpage.do?cz=[0-9\.]*/, "mainpage.do")
   } 
)  
trap("INT"){ s.shutdown }
s.start
Kyle Smith
  • 9,683
  • 1
  • 31
  • 32
  • I thought about the same thing during the course of the day, I'll implement it with python though, saw this: http://stackoverflow.com/questions/989739/how-to-write-a-proxy-server-in-python – Lex Oct 23 '09 at 16:22