0

So the problem I'm having is that I'm iterating over a pretty large csv file. startDate and endDate are input given to me by the user and I need to only search in that range.

Although, when I run the program up to that point, it takes a long time to just spit back out "set()" at me. I've pointed where I'm having trouble at in the code

looking for suggestions and possibly sample code, thank you all in advance!

def compare(word1, word2, startDate, endDate):
    with open('all_words.csv') as allWords:
        readWords = csv.reader(allWords, delimiter=',')
        year = set()
        for row in readWords:
            if row[1] in range(int(startDate), int(endDate)): #< Having trouble here
                if row[0] == word1:
                    year.add(row[1])
        print(year)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
Blakester
  • 99
  • 1
  • 2
  • 9

2 Answers2

3

The reason your test isn't finding any years is that the expression:

row[1] in range(int(startDate), int(endDate))

is checking to see if a string value appears in a list of integers. If you test:

"1970" in range(1960, 1980)

you will see that it returns False. You need to write:

int(row[1]) in range(int(startDate), int(endDate))

However, this is still quite inefficient. It is checking if the value int(row[1]) occurs anywhere in the sequence [int(startDate), int(startDate)+1, ..., int(endDate)], and it's doing it by linear search. Much faster will be:

if int(startDate) <= int(row[1]) < int(endDate):

Note that your code above was written to exclude endDate for the list of possible dates (because range excludes its second argument), and I've done the same above.

Edit: Actually, I guess I should point out that it's only Python 2 where an expression like 500000 in range(1, 1000000) is inefficient. In Python 3 (or in Python 2 with xrange in place of range), it's fast.

K. A. Buhr
  • 45,621
  • 3
  • 45
  • 71
1

You can try read_csv function of pandas library. This function allows you to read a desirable amount of data each time. So you can overcome the size problem.

reader = pd.read_csv(file_name, chunksize=chunk_size, iterator=True)

while True:
    try:
        df = reader.get_chunk(chunk_size)
        # select data rows which have desired dates
    except:
        break
    del df
amin
  • 1,413
  • 14
  • 24