-1

I found this piece of code here, which allows me to download a single file from an online zip file. It works miraculously but I don't understand how it works, especially how the class works here (I only have some basic knowledge on class). I simplified the original code a bit to get the below MWE.

import zipfile
import urllib2

DEBUG = True
def HTTPGetFileSize(url):
  request = urllib2.Request(url)
  page = urllib2.urlopen(request)
  size = page.headers['content-length']
  page.close()
  return int(size)

def HTTPGetPartialData(url, f, t):
  request = urllib2.Request(url)
  request.headers['range'] = 'bytes=%u-%u' % (f, t)
  partial_page = urllib2.urlopen(request)
  partial_data = partial_page.read()
  partial_page.close()
  return partial_data

class MyFileWrapper:
  def __init__(self, url):
    self.url = url
    self.position = 0
    self.total_size = HTTPGetFileSize(url)

  def seek(self, offset, whence):
    if whence == 0:
      self.position = offset
    elif whence == 1:
      self.position += offset
    elif whence == 2:
      self.position = self.total_size + offset

    if DEBUG==True:
      print "seek: (%u) %u -> %u" % (whence, offset, self.position)
    pass

  def tell(self):
    if DEBUG==True:    
      print "tell: -> %u" % self.position
    return self.position

  def read(self, amount=-1):
    if amount == -1:
      amount = self.total_size - self.position
    d = HTTPGetPartialData(self.url, self.position, self.position + amount - 1)
    self.position += len(d)
    if DEBUG==True:
      print "read: %u %u -> %u" % (self.position - len(d), amount, self.position)
    return d

url = 'http://the.url.that/contains/the/zipfiles.zip'
filename = 'the_name_of_the_file_I_need.csv'
f = MyFileWrapper(url)
print "class like object f is constructed"
z = zipfile.ZipFile(f)
print "f is read by zipfile and passed to z"
content = z.open(filename)
print "open filename, pass to content"
print content.read()

I have a lot of questions, but I am mainly confused by:

  1. How does my input filename ever get into all the functions?
  2. What is the flow/order of the functions in this piece of codes? It seems after running tell function, the codes go back to seek function again.
  3. How are offset and whence initialized and updated?

Any help is appreciated.

EDIT: I include the debugged version of the code and below is the output of a sample test:

class like object f is constructed
seek: (2) 0 -> 34632410
tell: -> 34632410
seek: (2) -22 -> 34632388
read: 34632388 22 -> 34632410
seek: (2) -42 -> 34632368
read: 34632368 20 -> 34632388
seek: (0) 34622294 -> 34622294
read: 34622294 10094 -> 34632388
f is read by zipfile and passed to z
seek: (0) 34621363 -> 34621363
read: 34621363 30 -> 34621393
read: 34621393 41 -> 34621434
open filename, pass to content
read: 34621434 860 -> 34622294
....content of the filename.....
Zhen Sun
  • 817
  • 3
  • 13
  • 20
  • I found an almost the same implementation here: http://stackoverflow.com/questions/7829311/is-there-a-library-for-retrieving-a-file-from-a-remote-zip?answertab=votes#tab-top, so it seems not dependant on `ZipFile` but a standalone dynamic downloading script. – Zhen Sun Jan 26 '15 at 05:03
  • Right. MyFikeWrapper let's you use any HTTP accessible file access as if it were a local file (if the server supports the Range header). This script uses it to read a zip file. – Jasper Jan 26 '15 at 07:29

2 Answers2

0

This is a prime example of Python's "duck typing" approach. To explain this, let's frist consider what a file actually is, from a programmer's point of view:

  • A file is something that can be read and delivers bytes to the program.
  • A file has a length (that may change during a file's lifetime, but at any point in time, it has a length).
  • Some files can be written to, but some can not be written to: a file on a CD/DVD/Bluray is a true "read only" file, but still a file.

The code in your example provides a class that implements a read method, which in the end returns bytes, so it can be treated like a normal file object in Python (if it has a read method, it's a file!)

Consider this minimal example:

class SimpleFile(object):
    def read(self):
        return b"a,b,c,d"

class SimpleFileUser(object):
    def __init__(self, f):
        self.f = f

    def use_file(self):
        print(self.f.read())

sf = SimpleFile()
sfu_1 = SimpleFileUser(sf)
sfu_1.use_file()

real_file = open('test.txt')
sfu_2 = SimpleFileUser(real_file)
sfu_2.use_file()

The class SimpleFileUser can use anything as a file that implements a read method. That could be a file object returned by open, or an instance of the SimpleFile class, because this class also provides a read method.

The class MyFileWrapper implements functions that allow you to access a file that is reachable via HTTP. Therefore, it provides functions to tell the current position in the file, seek (jump) to a different position in the file, and to read actual data from the file. The file in this case is the thing accessible via HTTP. How the methods are called is up to the ZipFile. If you use the DEBUG variable in the original code, you can see what ZipFile is actually doing to read the data.

offset and whence are just parameters to the seek function. It is modeled after the: C function. The current position in the file/HTTP accessible thing is stored in the member variable self.position

The seek and tell methods are defined to enable the zipfile class to determine the file size by setting the file pointer to the end of the file (seek(2, 0)) and getting the file pointer position (tell).

Jasper
  • 3,939
  • 1
  • 18
  • 35
  • Thanks. But it's still unclear to me. Why `whence` is initially 2 and `offset` 0? – Zhen Sun Jan 26 '15 at 00:19
  • If you see this in the debug output, it means that `tell` is called with the values 2 and 0. This is not uncommon to determine the "file" size: seek(END,0) to jump to the end, and then `tell` to dertemine the file size. I'll update my answer soon. – Jasper Jan 26 '15 at 07:26
  • Thanks. I am starting to get some idea but it's still difficult for me to connect the dots. So in that code, both `z` and `content` inherit the properties of the `MyFileWrapper` class? I realized `content.read()` actually calls `MyFileWrapper.read()` instead of the `zipfile.ZipFile.read()`. I included a debugged version of the output in the question. – Zhen Sun Jan 27 '15 at 01:31
0
  1. how does my input filename ever get into all the functions?

It doesn't. It's passed into the ZipFile only. The functions in this code is not using the filename.

  1. what is the flow/order of the functions in this piece of codes? It seems after running tell function, the codes go back to seek function again.

There's no specific order. I'm unsure of what you are asking.

  1. How are offset and whence initialized and updated?

In the ZipFile, by the ZipFile code.

What this code does is wrap an online File so that only the parts of it actually read are downloaded. The rest of the "magic" is standard ZipFile behavior. It's the read() method that is the interesting one.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • Thanks for the explanation. Regarding your answer to Q1, I don't understand why the code `content = z.open(filename)` calls `seek` and `read` multiple times. Does `z` inherit the properties of `f`, and some how these properties also get called when I call `z.open()`? – Zhen Sun Jan 27 '15 at 01:15
  • No, z is the ZipFile object. It is an object that can read from a zipfile. f is a zipfile. f is passed in to the ZipFile object f when it is created. There are several seeks and reads, because the ZipFile object will read different parts of the zip file. The index for example is in the end of the file. So several seek() calls are needed. – Lennart Regebro Jan 27 '15 at 12:04