How do I read an HTML file in Python from multiple URLs?

Question

I'm writing a script that will pull data from a basic HTML page based on the following:

The first parameter in the URL floats between -90.0 and 90.0 (inclusive) and the second set of numbers are between -180.0 and 180.0 (inclusive). The URL will direct you to one page with a single number as the body of the page (for example, http://jawbone-virality.herokuapp.com/scanner/desert/-89.7/131.56/). I need to find the largest virality number between all of the pages attached to the URL.

So, right now I have it printing the first and second number, as well as the number in the body (we call it virality). It's only printing to the console, every time I try writing it to a file it spazzes on me and I get errors. Any hints or anything I'm missing? I'm very new to Python so I'm not sure if I'm missing something or anything.

import shutil
import os
import time
import datetime
import math
import urllib
from array import array
myFile = open('test.html','w')
m = 5
for x in range(-900,900,1):
    for y in range(-1800,1800,1):
        filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/'+str(x/10)+'/'+str(y/10)+'/')
        print 'Planet Desert: (' + str(x/10) +','+ str(y/10) + '), Virality: ' + filehandle.readlines()[0] #lines
        #myFile.write('Planet Desert: (' + str(x/10) +','+ str(y/10) + '), Virality: ' + filehandle.readlines()[0])
myFile.close()
filehandle.close()

Thank you!

Slightly off topic, but this code will make around 6.5 million HTTP requests... is this what you really want!? If so you may be better off using a multi-threaded approach with Queue - http://docs.python.org/2/library/queue.html#module-Queue — will-hart, Sep 08 '13 at 08:30
One could also use [Scrapy](http://scrapy.org/) for this problem (it's already threaded to my experience). — aufziehvogel, Sep 08 '13 at 09:24
Why are you using `myFile.readlines()[0]` when `myFile.readline()` (singular) would do? — Martijn Pieters, Sep 08 '13 at 09:43
@Aufziehvogel that's exactly something I was going for. Is there something easier to install?? — Cassidy, Sep 08 '13 at 12:32
@will-hart honestly, I just want the task done, I don't have much time to do it, and I'm new to Python, so it doesn't matter what happens... — Cassidy, Sep 08 '13 at 12:33
@CassidyWilliams fair enough :) the main issue is going to be speed - making 6.5 million requests on a single thread will take a lot longer than doing it on 20 at once. (Plus it looks like you have to do this task for six different URLs so that's nearly 40 million requests!) Anyways, good luck :) — will-hart, Sep 08 '13 at 14:09
@CassidyWilliams Using windows? Just saw the Installation guide of scrapy for windows :D You could write the threads your own, it's not that difficult with Pythons `threading` module. It features both threading of functions (which might be enough for you) and classes ([tutorial](http://www.tutorialspoint.com/python/python_multithreading.htm)). — aufziehvogel, Sep 08 '13 at 14:58

score 0 · Answer 1 · edited May 23 '17 at 12:11

When writing to the file, do you still have the print statement before? Then your problem would be that Python advances the file pointer to the end of the file when you call readlines(). The second call to readlines() will thus return an empty list and your access to the first element results in an IndexError.

See this example execution:

filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/0/0/')
print(filehandle.readlines())  # prints ['5']
print(filehandle.readlines())  # prints []

The solution is to save the result into a variable and then use it.

filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/0/0/')
res = filehandle.readlines()[0]
print(res)  # prints 5
print(res)  # prints 5

Yet, as already pointed out in the comments, calling readlines() here is not needed, because as it seems the format of the website is only a pure integer. So the concept of lines does not really exist there or does at least not provide any more information. So let's drop it in exchange for a easier function read() (doesn't even need readline() here).

filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/0/0/')
res = filehandle.read()
print(res)  # prints 5

There's still another problem in your sourcecode. From your usage of urllib.urlopen() I can derive, you are using Python 2. However, in Python 2 divisions of integers are handled like in C or Java, they result in an integer rounded to floor. Thus, you will call http://jawbone-virality.herokuapp.com/scanner/desert/-90/-180/ ten times.

This can be fixed by either:

from __future__ import division
str(x / 10.0) and str(y / 10.0)
switching to Python 3 and using urllib2

Hopefully, I could help.

How do I read an HTML file in Python from multiple URLs?

1 Answers1