Iterating over a csv file given a specific range

Question

So the problem I'm having is that I'm iterating over a pretty large csv file. startDate and endDate are input given to me by the user and I need to only search in that range.

Although, when I run the program up to that point, it takes a long time to just spit back out "set()" at me. I've pointed where I'm having trouble at in the code

looking for suggestions and possibly sample code, thank you all in advance!

def compare(word1, word2, startDate, endDate):
    with open('all_words.csv') as allWords:
        readWords = csv.reader(allWords, delimiter=',')
        year = set()
        for row in readWords:
            if row[1] in range(int(startDate), int(endDate)): #< Having trouble here
                if row[0] == word1:
                    year.add(row[1])
        print(year)

I do not Amin, I ask for input for the desired start date and end date. So it will always vary depending on what they type in — Blakester, Nov 27 '16 at 07:27
What format are the dates in? You do `int(startDate)` ... its an integer? — tdelaney, Nov 27 '16 at 07:32

K. A. Buhr · Accepted Answer · 2016-11-27T07:37:50.900

The reason your test isn't finding any years is that the expression:

row[1] in range(int(startDate), int(endDate))

is checking to see if a string value appears in a list of integers. If you test:

"1970" in range(1960, 1980)

you will see that it returns False. You need to write:

int(row[1]) in range(int(startDate), int(endDate))

However, this is still quite inefficient. It is checking if the value int(row[1]) occurs anywhere in the sequence [int(startDate), int(startDate)+1, ..., int(endDate)], and it's doing it by linear search. Much faster will be:

if int(startDate) <= int(row[1]) < int(endDate):

Note that your code above was written to exclude endDate for the list of possible dates (because range excludes its second argument), and I've done the same above.

Edit: Actually, I guess I should point out that it's only Python 2 where an expression like 500000 in range(1, 1000000) is inefficient. In Python 3 (or in Python 2 with xrange in place of range), it's fast.

If you know the dates will always be four digit years, you can skip converting to `int`. — chthonicdaemon, Nov 27 '16 at 07:41

score 1 · Answer 2 · answered Nov 27 '16 at 07:36

You can try read_csv function of pandas library. This function allows you to read a desirable amount of data each time. So you can overcome the size problem.

reader = pd.read_csv(file_name, chunksize=chunk_size, iterator=True)

while True:
    try:
        df = reader.get_chunk(chunk_size)
        # select data rows which have desired dates
    except:
        break
    del df

Iterating over a csv file given a specific range

2 Answers2