15

If a function takes as an input the name of a text file, I can refactor it to instead take a file object (I call it "stream"; is there a better word?). The advantages are obvious - a function that takes a stream as an argument is:

  • much easier to write a unit test for, since I don't need to create a temporary file just for the test
  • more flexible, since I can use it in situations where I somehow already have the contents of the file in a variable

Are there any disadvantages to streams? Or should I always refactor a function from a file name argument to a stream argument (assuming, of course, the file is text-only)?

max
  • 49,282
  • 56
  • 208
  • 355

2 Answers2

7

... Here is how xml.etree.ElementTree module implements the parse function:

def parse(self, source, parser=None):
    close_source = False
    if not hasattr(source, "read"):
        source = open(source, "rb")
        close_source = True
    ...

As filename is a string, it does not have the read() method (here whatever attribute of that name is checked); however, the open file has it. The four lines makes the rest of code common. The only complication is that you have to remember whether to close the file object (here named source) or not. If it was open inside, then it must be closed. Otherwise, it must not be closed.

Actually, files differ from sreams slightly. Streams are potentially infinite while files usually not (unless some device is mapped as if it were file). The important difference when processing is, that you can never read the stream into memory at once. You have to process it by chunks.

n611x007
  • 8,952
  • 8
  • 59
  • 102
pepr
  • 20,112
  • 15
  • 76
  • 139
  • 1
    I was looking for a reference implementation on this in the stdlib. Thanks for the snippet it really saves time. I would give another +1 for the warning for chunks if I could. – n611x007 Apr 03 '13 at 13:06
4

There are numerous functions in the python standard library which accept both -- strings which are filenames or open file objects (I assume that's what you're referring to as a "stream"). It's really not hard to create a decorator that you can use to make your functions accept either one.

One serious drawback to using "streams" is that you pass it to your function and then your function reads from it -- effectively changing it's state. Depending on your program, recovering that state could be messy if it's necessary. (e.g. you might need to litter you code with f.tell() and then f.seek().)

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • Yes, when I said "stream", I meant "open file object". Wouldn't it be possible to write a decorator that saves and restores stream state? – max Sep 25 '12 at 05:43
  • And isn't there a way to create an inexpensive copy of a stream, such that the copy owns its own "pointer", while the "pointer" of the original stream is left untouched? That would be even cleaner than save/restore state approach. – max Sep 25 '12 at 05:44
  • @max -- Sure, you could write a decorator to do that. The important thing is to document when you're restoring the state and when you're not. As far as creating a copy, the only thing I can think of is `itertools.tee`, which is a little bit different (but it is way past my normal bedtime, so I don't guarantee anything that I type right now :^) . – mgilson Sep 25 '12 at 05:45
  • 2
    So file name vs file object feels a bit like iterable vs iterator. – max Sep 25 '12 at 05:51
  • 1
    @max -- I suppose it is similar. – mgilson Sep 25 '12 at 05:53
  • Actually, can you give an example of a library function that does this? – max Sep 25 '12 at 06:03
  • 1
    I can't speak for others, but I usually *WANT* the function to change the state of the stream. E.g. I want my (hypothetical) "parse_header" function to leave the file pointer at the end of the header, so that the following "read_item" can then start reading from the appropriate point in the file. – janneb Sep 25 '12 at 06:38
  • @janneb -- I do too. My point is that you need to be careful to keep track of where the file pointer is. – mgilson Sep 25 '12 at 06:42
  • The `xml.etree.ElementTree.parse()` function accept also filename or open file. The problem with users is that you never know what he prefers. It is sometimes simply handy just to pass the filename. *Readability counts*. It is easier to read simpler code. – pepr Sep 25 '12 at 21:45
  • @mgilson not sure if I've got you right. Actually, [csv.reader](https://docs.python.org/2/library/csv.html#csv.reader) ***doesn't seem to be supposed to accept filenames***. It accepts any iterable that returns strings which will go horribly wrong on a filename. As 'streams', it definitely also accepts not just file-like objects though, but any other suitable iterable objects, if that's what you meant. – n611x007 May 09 '14 at 13:03
  • @naxa -- Yeah, I'm not sure why I commented about `csv.reader`. It clearly doesn't have that behavior. `xml.etree.ElementTree.parse` is a correct example. – mgilson May 09 '14 at 16:17