0

I am processing a huge postgresql database for which I have created a "fetch" function.

def fetch(cursor, batch_size=1e3):
    """An iterator that uses fetchmany to keep memory usage down"""
    while True:
        records = cursor.fetchmany(int(batch_size))
        if not records:
            break
        for record in records:
            yield record

For each item I am doing some processing, but right now I have a problem where the last item in some cases will be omitted as I am doing some comparison between the items. And as soon as that comparison doesn't yield on the last item nothing will be done.

connection = psycopg2.connect(<url>)
cursor = connection.cursor()

cursor.execute(<some query>)

temp_today = 0

for row in fetch(cursor):
    item = extract_variables(row)
    date = item['datetime']
    today = date.date()
    if temp_today is 0:
        # do something with first row
        temp_today = date
    # -----------------------------------------
    # I feel like I am missing a statement here
    # something like:
    # if row == rows[-1]:
    #     do something with last row..
    # -----------------------------------------
    elif temp_today.date() == today:
        # do something with every row where 
        # the date is the same
    else:
        # do something with every row where
        # the dates ain't the same

How do I do something with the last item when I am using a yield?

It is extremely important for me to use yield as I am handling a VERY huge dataset and I will run out of memory if I don't.

Zeliax
  • 4,987
  • 10
  • 51
  • 79
  • 1
    It should be possible to get the number of rows in result set from the cursor, right? Then you could just compare a counter (enumerate) with that number. – Peter Smit Feb 15 '17 at 11:43
  • 1
    `... as I am doing some comparison between the items` You could do this in the database (by using window functions, or by some self-join) – wildplasser Feb 15 '17 at 11:47

2 Answers2

0

You can define another generator so you can iterate over the item returned and the previous one (if any):

def pair( sequence):
    previous = None
    for item in sequence:
        yield (item, previous)
        previous = item

for item, previous_item in pair( mygenerator( args))
    if previous_item is None:
        # process item: first one returned
    else:
        # you can compare item and previous_item
nigel222
  • 7,582
  • 1
  • 14
  • 22
0

Thanks to @Peter Smit from the comments I used the following solution:

connection = psycopg2.connect(<url>)
cursor = connection.cursor()

cursor.execute(<some query>)

temp_today = 0
parsed_count = 0
cursor_count = cursor.rowcount

for row in fetch(cursor):
    item = extract_variables(row)
    date = item['datetime']
    today = date.date()
    if temp_today is 0:
        # do something with first row
        temp_today = date
    elif parsed_count == cursor_count:
        # do something with the last row
    elif temp_today.date() == today:
        # do something with every row where 
        # the date is the same
    else:
        # do something with every row where
        # the dates ain't the same
Zeliax
  • 4,987
  • 10
  • 51
  • 79