Use only a certain portion of file in every iteration

Question

I am using an external API for Python (specifically 3.x) to get search results based on certain keywords located in a .txt file. However, due to a constraint as to how many keywords I can search for in every time interval (assume I need an hourly wait) I run the script, I can only use a portion of the keywords (say 50 keywords). How can I, Pythonically, use only a portion of the keywords in every iteration?

Let's assume I have the following list of keywords in the .txt file myWords.txt:

Lorem #0
ipsum #1
dolor #2
sit   #3
amet  #4
...
vitae #167

I want to use on the keywords found in 0-49 (i.e. the first 50 lines) on the first iteration, 50-99 on the second, 100-149 on the third, and 150-167 on the fourth and last iteration.

This is, of course, possible by reading the whole file, read an iteration counter saved elsewhere, and then choose the keyword range residing in that iterable part of the complete list. However, in what I'd like to do, I do not want to have an external counter, but rather only have my Python script and the myWords.txt where the counter is dealt with in the Python code itself.

I want to take only the keywords that I should be taking in the current run of the script (depending on the (total number of keywords)/50). At the same time, if I were to add any new keywords at the end of the myWords.txt, it should adjust the iterations accordingly, and if needed, add new iterations.

This is occurring in separate runs, correct? If so, you're going to need to store your progress in some way, the script can't just magically remember where it left off after it has exited. A separate file with the results of `fileobj.tell()` or the like would be simple, but feel fairly kludgy. Is there a reason you're waiting an *hour* between checks, and only processing 50 words, and a reason why the script can't just `sleep` if it absolutely must do so? — ShadowRanger, Jan 29 '19 at 12:52
Yes, it will be in seperate runs. I was simply wondering if I can somehow use something that isn't so kludgy, hence an alternative to store the progress externally. The reason why sleep isn't much of an alternative is due to processing of the results later on in the script, which are in turn used by another script. @ShadowRanger — KaanTheGuru, Jan 29 '19 at 12:57

score 2 · Accepted Answer · answered Jan 29 '19 at 13:43

As far as I know there is no way to persist the keywords used between different invocations of your script. However, you do have a couple of choices in how you implement a "persistent storage" of the information that you need in different invocations of the script.

Instead of just having a single input file named myWords.txt, you could have two files. One file containing keywords that you want to search for and one file containing keywords that you've already searched for. As you search for keywords you remove them from the one file and place them in the other.
You can implement a persistent storage strategy, that stores the words.
(The easiest thing and what I would do) is just have a file named next_index.txt and store the last index from your iteration.

Here is an implementation of what I would do:

Create a next position file

echo 0 > next_pos.txt

Now do your work

with open('next_pos.txt') as fh:
    next_pos = int(fh.read().strip())

rows_to_search = 2 # This would be 50 in your case
keywords = list()
with open('myWords.txt') as fh:
    fh.seek(next_pos)
    for _ in range(rows_to_search):
       keyword = fh.readline().strip()
       keywords.append(keyword)
       next_pos = fh.tell()

# Store cursor location in file.
with open('next_pos.txt', 'w') as fh:
    fh.write(str(next_pos))

# Make your API call
# Rinse, Wash, Repeat

As I've stated you have lots of options, and I don't know if any one way is more Pythonic than any other, but whatever you do try and keep it simple.

score 0 · Answer 2 · answered Jan 29 '19 at 13:18

Try this. Modify for your needs.

$ cat foo
1
2
3
4
5
6
7
8
9
10

cat getlines.py
import sys


def getlines(filename, limit):
    with open(filename, 'r') as handle:
        keys = []
        for idx, line in enumerate(handle):
            if idx % limit == 0 and idx != 0:
                yield keys
                keys = []
            keys.append(line.strip())

print(list(getlines('foo', 2)))
print(list(getlines('foo', 3)))
print(list(getlines('foo', 4)))

Use only a certain portion of file in every iteration

2 Answers2