Screen Scraping with HTTP Headers Issue - I Think

Question

I've been trying to figure this one out for about a week now and just can't come up with a good solution. So, I figured I would see if anyone could help me out. Here's one of the links that I'm trying to scrape:

http://content.lib.washington.edu/cdm4/item_viewer.php?CISOROOT=/alaskawcanada&CISOPTR=491&CISOBOX=1&REC=4

I right-clicked to copy image location. This is the link that is copied:

(Can't paste this as a link because I'm new) http:// content (dot) lib (dot) washington (dot) edu/cgi-bin/getimage.exe?CISOROOT=/alaskawcanada&CISOPTR=491&DMSCALE=100.00000&DMWIDTH=802&DMHEIGHT=657.890625&DMX=0&DMY=0&DMTEXT=%20NA3050%20%09AWC0644%20AWC0388%20AWC0074%20AWC0575&REC=4&DMTHUMB=0&DMROTATE=0

There is no clear image URL being displayed. Obviously that's because the image is hidden behind some type of script. Through trial and error I found that I can put ".jpg" after the "CISOPTR=491" and then the link becomes an Image URL. The problem is that this is not the high-resolution version of the image. To get to the high-resolution version I have to change the URL even more. I found a lot of articles @Stackoverflow.com to mention trying to build a script using curl and PHP, I have even tried a few of them with no luck. "491" is the image number and I can change that number to find other images in the same directory. So, scraping a sequence of numbers should be pretty easy. But I'm still a noob at scraping and this one is kicking my butt. Here's what I've tried.

Get remote image using cURL then resample

also tried this.

http://psung.blogspot.com/2008/06/using-wget-or-curl-to-download-web.html

I also have Outwit Hub, and Site Sucker, but they don't recognize the URL as an image file and fo they just pass right ove it. I used SiteSucker overnight and it download 40,000 files and only 60 were jpegs, none of which were the ones I wanted.

The other thing I keep running into, is the files I have been able to download manually, the filename is always either getfile.exe or showfile.exe and then if I manually add ".jpg" as the extension I can view the image locally.

How can I reached the original high-res image file, and automate the download process so that I can scrape a couple hundred of these images?

score 0 · Answer 1 · answered May 05 '12 at 09:13

I right-clicked to copy image location. This is the link that is copied:

You noticed the title has ".exe" in there. Look at the stuff in the query string:

DMSCALE=100.00000
DMWIDTH=802
DMHEIGHT=657.890625
DMX=0
DMY=0
DMTEXT=%20NA3050%20%09AWC0644%20AWC0388%20AWC0074%20AWC0575
REC=4
DMTHUMB=0
DMROTATE=0

Strongly implies the original source of this image is in a database or something and it is being passed thru a server-side filter (not sure if that is what you meant by "some kind of script"). Ie, this is dynamically generated content, not static, and the same caveats apply as would to dynamic text content: you have to figure out what instructions to provide the server to get it to cough up what you want. Which you pretty much have in front of you...if SiteSucker or whatever won't deal with it properly, scrape the address yourself using an HTML parser.

Thanks for the reply, and confirming some of my assumptions. I'm extremely inexperienced in server-side functionality and lingo so please forgive any incorrect verbiage. I have actually been playing around with the the different parameters in the query string and I've been able to get the image to it's max resolution as a JPG, I've even been copying and pasting the parameters that I wound up at and they produce the high res JPG everytime. It would judt be nice to be able to automate the process. For instance It's just a pain to have to — user1376196, May 07 '12 at 03:18

Screen Scraping with HTTP Headers Issue - I Think

1 Answers1