2

I am reading an xlsx file (using openpyxl) and a csv (using csv.reader). The openpyxl returns a generator properly, I can iterate over the values in the generator after it is returned from a function that differentiates whether the file is an excel file or a csv. The problem arises when I am doing the same thing with a csv file, you see it returns a generator but I can't iterate over it since the csv file appears to be closed after the function return within the with statement. I know it's obvious that file closes after the with statement has fullfilled it's purpose, but why then does the openpyxl work? why can I still can iterate over the generator of an excel file? and, my ultimate question, how can I make the csv.reader behave the way openpyxl behaves here, i.e. me being able to iterate over the generator values.

import csv
from openpyxl import load_workbook


def iter_rows(worksheet):
    """
    Iterate over Excel rows and return an iterator of the row value lists
    """
    for row in worksheet.iter_rows():
        yield [cell.value for cell in row]


def get_rows(filename, file_extension):
    """
    Based on file extension, read the appropriate file format
    """
    # read csv
    if file_extension == 'csv':
        with open(filename) as f:
            return csv.reader(f, delimiter=",")

    # read Excel files with openpyxl
    if file_extension in ['xls', 'xlsx']:
        wb2 = load_workbook(filename)
        worksheet1 = wb2[wb2.get_sheet_names()[0]]
        return iter_rows(worksheet1)


# this works properly
rows = get_rows('excels/ar.xlsx', 'xlsx')
print(rows)  # I am: <generator object iter_rows at 0x074D7A58>
print([row for row in rows])  # I am printing each row of the excel file from the generator

# Error: ValueError: I/O operation on closed file
rows = get_rows('excels/ar.csv', 'csv')
print(rows)  # I am: <generator object iter_rows at 0x074D7A58>
print([row for row in rows])  # ValueError: I/O operation on closed file
Ivan Bilan
  • 2,379
  • 5
  • 38
  • 58

2 Answers2

3

You don't use a with statement with openpxyl functions. But it seems you already know the problem, i.e. that you are trying to iterate over a file-handler after the with block has closed it. Iterate earlier? Or better yet, yield from the reader object:

def get_rows(filename, file_extension):
    """
    Based on file extension, read the appropriate file format
    """
    # read csv
    if file_extension == 'csv':
        with open(filename) as f:
            yield from csv.reader(f, delimiter=",")

    # read Excel files with openpyxl
    if file_extension in ['xls', 'xlsx']:
        wb2 = load_workbook(filename)
        worksheet1 = wb2[wb2.get_sheet_names()[0]]
        yield from iter_rows(worksheet1)

Or, if you are on Python 2:

def get_rows(filename, file_extension):
    """
    Based on file extension, read the appropriate file format
    """
    # read csv
    if file_extension == 'csv':
        with open(filename) as f:
            for row in csv.reader(f, delimiter=",")
                yield row

    # read Excel files with openpyxl
    if file_extension in ['xls', 'xlsx']:
        wb2 = load_workbook(filename)
        worksheet1 = wb2[wb2.get_sheet_names()[0]]
        for row in iter_rows(worksheet1):
            yield row

Also note 2 things:

  1. The addition of the yield from/yield makes the get_rows function a generator function, changing the semantics of the return iter_rows(worksheet1) line. You now want to yield from both branches.

  2. the way you originally wrote get_rows does not return a generator when you have a "csv". A csv.reader object is not a generator, (neither is, I believe a worksheet.iter_rows object, but I don't know because I don't use openpyxl). The reason your "xlsx" branch returns a generator is because you explicitly return the call to iter_rows, which you've defined as a generator. Your "csv" branch returns a csv.reader object. The latter is a lazy iterable, but it is not a generator. The former is a generator. Not all iterables are generators, but generators were added as a language construct to facilitate the writing of iterables, but now have expanded to be able to do all sorts of fancy stuff, like coroutines. See this answer to a famous question, which I think is better than the accepted answer.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • `yield from return`. Interesting -- I've never seen this before! Is that valid or just a typo? – erip Jun 25 '17 at 21:52
  • You don't want to mix a `return ...` with a `yield` in a generator function. – Moses Koledoye Jun 25 '17 at 21:53
  • @erip note, I've added an example of equivalent Python 2 syntax. Also, since the addition of the `yield from` made the function a generater function, that changes the semantics of the `return iter_rows(worksheet1)` line. I've fixed it up for both situations. – juanpa.arrivillaga Jun 25 '17 at 22:00
  • Were you going to fix that 'yield from return csv.reader' so it's just `yield from csv.reader`? – OldGeeksGuide Jun 25 '17 at 22:13
  • @OldGeeksGuide ugh, yes, thank you for the correction. – juanpa.arrivillaga Jun 25 '17 at 22:14
  • @juanpa.arrivillaga I guess OldGeeksGuide asked my question more clearly. :) – erip Jun 25 '17 at 22:15
  • @erip yes, sorry, my post was replete with typos. Hopefully I don't lead you astray. But I think now it is at least coherent. – juanpa.arrivillaga Jun 25 '17 at 22:16
  • @Blckknght yes. But there is a subtle distinction between a function that returns a generator object, and a generator function. – juanpa.arrivillaga Jun 25 '17 at 22:19
  • @juanpa.arrivillaga delegation to a subgenerator, this is ingenious! Thanks a lot! – Ivan Bilan Jun 25 '17 at 22:36
  • @ivan_bilan yes, if you are writing generators that function as simple iterables, it is essentially a convenient way not to have to write out a for-loop if you want to yield from another iterable within a generator. If it is a coroutine, then it gets more fancy. – juanpa.arrivillaga Jun 25 '17 at 22:44
2

The issue has to do with how you're handling the file. Both of the branches in your function return an iterator, but the CSV branch uses a with statement which closes the file automatically when you return. That means the iterator you get csv.reader is useless, since the file it tries to read from is already closed by the time your top level code can use it.

One way to work around this would be to make your get_rows function a generator. If you yield from the csv.reader instead of returning it, the file will not be closed until it has been fully read (or the generator is discarded). You'll also need to yield from the iter_rows generator you wrote for the other branch.

Blckknght
  • 100,903
  • 11
  • 120
  • 169