Periodically scrape and download all images from a website with javascript auto-scroll

Question

I have found a website that has lot of high quality free images hosted on Tumblr(It says do whatever you want with theme images :P)

I am running on Ubuntu 12.04LTS. I need to write a script which will run periodically(say daily) and download only the images that were not downloaded earlier.

Additional Note : It has a javascript auto-scroller and the images gets downloaded when you reach the bottom of the page.

Can you give the link of the website from where you want to download the images please. — Krishna Karki, Dec 28 '13 at 06:29

Tomas · Answer 1 · 2013-12-28T14:55:05.480

First, you have to find out how the autoscrolling script works. The easiest way to do this is not to reverse-engineer the javascript, but to look at the network activity. Easiest way to do this is to use Firebug Firefox plugin and look at the activity in the "Net" panel. You quickly see that the website is organized in pages:

unsplash.com/page/1
unsplash.com/page/2
unsplash.com/page/3
...

When you scroll, the script requests to download succeeding pages.

So, we can actually write a script to download all the pages, parse their html for all the images and download them. If you look at the html code, you see that images are there in nice and unique form:

<a href="http://bit.ly/14nUvzx"><img src="http://31.media.tumblr.com/2ba914db5ce556ee7371e354b438133d/tumblr_mq7bnogm3e1st5lhmo1_1280.jpg" alt="Download &nbsp;/ &nbsp;By Tony&nbsp;Naccarato" title="http://unsplash.com/post/55904517579/download-by-tony-naccarato" class="photo_img" /></a>

The <a href contains URL of the full resolution image. The title attribute contains a nice unique URL that also leads to the image. We will use it to construct nice unique name for the image, much nicer than the one under which it is stored. This nice unique name will also assure that no image is downloaded twice.

Shell script (unsplash.sh)

mkdir imgs
I=1
while true ; do # for all the pages
        wget unsplash.com/page/$I -O tmppage
        grep '<a href.*<img src.*title' tmppage > tmppage.imgs
        if [ ! -s tmppage.imgs ] ; then # empty page - end the loop
                break
        fi
        echo "Reading page $I:"
        sed 's/^.*<a href="\([^"]*\)".*title="\([^"]*\)".*$/\1 \2/' tmppage.imgs | while read IMG POST ; do
                # for all the images on page
                TARGET=imgs/`echo $POST | sed 's|.*post/\(.*\)$|\1|' | sed 's|/|_|g'`.jpg
                echo -n "Photo $TARGET: "
                if [ -f $TARGET ] ; then # we already have this image
                        echo "already have"
                        continue
                fi
                echo "downloading"
                wget $IMG -O $TARGET
        done
        I=$((I+1))
done

To make sure this runs every day..

create a wrapper script usplash.cron:

#!/bin/bash

export PATH=... # might not be needed, but sometimes the PATH is not set 
                # correctly in cron-called scripts. Copy the PATH setting you 
                # normally see under console.

cd YOUR_DIRECTORY # the directory where the script and imgs directory is located

{
echo "========================"
echo -n "run unsplash.sh from cron "
date

./unsplash.sh 

} >> OUT.log 2>> ERR.log

Then add this line in your crontab (after issuing crontab -e on the console):

10 3 * * * PATH_to_the/unsplash.cron

This will run the script every day at 3:10.

that is some neat code :) On the subsequent cron runs it would iterate through all the images though we already have them. I was just thinking if there is a way we could identify this and exit completely. But if the script stopped somewhere in between we would never download these images. I am not sure if there is a proper way out of this. — Gokul N K, Dec 29 '13 at 09:06
@GokulNK, this code will never download already downloaded image. It only downloads all the HTML *pages* at each run, because we cannot be sure if the newly added images are at the first or at the last page; and that something is not added in the middle too. Once you figure out this pattern (which you will never 100%, they may change it!!), you might "optimize" the code; but since we are only speaking about downloading 30 HTML pages, is it worth the effort and lack of robustness? — Tomas, Dec 29 '13 at 09:35
I do agree with you. Just wanted to check if there is a simple way out ;) Thanks for the elegant script. — Gokul N K, Dec 30 '13 at 05:04

score 2 · Answer 2 · answered Dec 28 '13 at 14:49

Here's a small python version of the download part. The getImageURLs function looks for the data from http://unsplash.com/page/X for lines which contain the word 'Download', and look for the image 'src' attribute there. It also looks for strings current_page and total_pages (are present in the javascript code) to find out how long to keep going.

Currently, it retrieves all the URLs from all the pages first, and for these URLs the image is downloaded if the corresponding file does not exist locally. Depending on how the page numbering changes over time, it may be somewhat more efficient to stop looking for image URLs as soon as an local copy of a file has been found. The files get stored in the directory in which the script was executed.

The other answer explains very well how to make sure something like this can get executed daily.

#!/usr/bin/env python

import urllib
import pprint
import os

def getImageURLs(pageIndex):
    f = urllib.urlopen('http://unsplash.com/page/' + str(pageIndex))
    data = f.read()
    f.close()

    curPage = None
    numPages = None
    imgUrls = [ ]

    for l in data.splitlines():
        if 'Download' in l and 'src=' in l:
            idx = l.find('src="')
            if idx >= 0:
                idx2 = l.find('"', idx+5)
                if idx2 >= 0:
                    imgUrls.append(l[idx+5:idx2])

        elif 'current_page = ' in l:
            idx = l.find('=')
            idx2 = l.find(';', idx)
            curPage = int(l[idx+1:idx2].strip())
        elif 'total_pages = ' in l:
            idx = l.find('=')
            idx2 = l.find(';', idx)
            numPages = int(l[idx+1:idx2].strip())

    return (curPage, numPages, imgUrls)

def retrieveAndSaveFile(fileName, url):
    f = urllib.urlopen(url)
    data = f.read()
    f.close()

    g = open(fileName, "wb")
    g.write(data)
    g.close()

if  __name__ == "__main__":

    allImages = [ ]
    done = False
    page = 1
    while not done:
        print "Retrieving URLs on page", page
        res = getImageURLs(page)
        allImages += res[2]

        if res[0] >= res[1]:
            done = True
        else:
            page += 1

    for url in allImages:
        idx = url.rfind('/')
        fileName = url[idx+1:]
        if not os.path.exists(fileName):
            print "File", fileName, "not found locally, downloading from", url
            retrieveAndSaveFile(fileName, url)

    print "Done."

score 2 · Accepted Answer · edited May 23 '17 at 11:50

The fantastic original script done by TMS no longer works with the new unsplash website. Here is an updated working version.

#!/bin/bash
mkdir -p imgs
I=1
while true ; do # for all the pages
        wget "https://unsplash.com/grid?page=$I" -O tmppage

        grep img.*src.*unsplash.imgix.net tmppage | cut -d'?' -f1 | cut -d'"' -f2 > tmppage.imgs

        if [ ! -s tmppage.imgs ] ; then # empty page - end the loop
                break
        fi

        echo "Reading page $I:"
        cat tmppage.imgs | while read IMG; do

                # for all the images on page
                TARGET=imgs/$(basename "$IMG")

                echo -n "Photo $TARGET: "
                if [ -f $TARGET ] ; then # we already have this image
                        echo "file already exists"
                        continue
                fi
                echo -n "downloading (PAGE $I)"

                wget $IMG -O $TARGET
        done
        I=$((I+1))
done

Making this the accepted answer. As many people would come looking for a solution and a working one is always better. — Gokul N K, Nov 07 '14 at 06:12
It is a matter of good manners to at least cite the answer you got the code from. — Tomas, Nov 07 '14 at 14:15

Periodically scrape and download all images from a website with javascript auto-scroll

3 Answers3

Shell script (unsplash.sh)

To make sure this runs every day..