Python: processing lines of a large document on the fly

Question

I have a document that looks a bit like this:

key1 value_1_1 value_1_2 value_1_3 etc
key2 value_2_1 value_2_2 value_2_3 etc
key3 value_3_1 value_3_2 value_3_3 etc
etc

Where each key is a string and each value is a float, all separated by spaces. Each line has hundreds of values associated with it, and there are hundreds of thousands of lines. Each line needs to be processed in a particular way, but because my program will only ever need the information from a small fraction of the lines, it seems like a giant waste of time to immediately process each line. Currently, I just have a list of each unprocessed line, and maintain a separate list containing each key. When I need to access a line I'll use the key list to find the index of the line I need, then process the line at that index in the lines list. My program may potentially call for looking up the same line multiple times, which would result in redundantly processing the same line over and over again, but still seems better than processing every single line right from the start.

My question is, is there a more efficient way to do what I'm doing?

(and please let me know if I need to make any clarifications)

Thanks!

Is the key fixed-length? Or just some sequence of non-space characters? — PaulMcG, May 06 '17 at 17:58
Then you are stuck with using split() to get them. At least do `line.split(None, 1)`, so you only do the minimum work necessary to get the leading key. — PaulMcG, May 06 '17 at 19:05

Sebastiaan · Accepted Answer · 2017-05-06T15:13:52.583

3

First I would store your lines in a dict. This probably makes lookups based on the key a lot faster. Making this dict can be as simple as d = dict(line.split(' ', 1) for line in file_obj). If the keys have a fixed width for example you could speed this up even a bit more by just slicing the lines.

Next, if the line processing is very computationally heavy, you could buffer the results. I worked this out once by subclassing a dict:

class BufferedDict(dict):
    def __init__(self, file_obj):
        self.file_dict = dict(line.split(' ', 1) for line in file_obj)

    def __getitem__(self, key):
        if key not in self:
            self[key] = process_line(self.file_dict[key])
        return super(BufferedDict, self).__getitem__(key)

def process_line(line):
    """Your computationally heavy line processing function"""

This way, if you call my_buffered_dict[key], the line will be processed only if the processed version wasn't available yet.

edited May 06 '17 at 15:13

answered May 06 '17 at 12:48

Sebastiaan

1,166
10
18

Your dict comprehension splits the line twice, once to get the key and once to lstrip the key from the line to get the value, and the OP is trying to avoid extra work. I think you could fix this with `self.file_dict = {parts[0]:parts[1:] for line in file_obj for parts in [line.split()]}`, but this is so ugly, I'd probably just use an explicit for-loop. Also, since your class extends dict, then code might call `__setitem__` which isn't really appropriate for this application. – PaulMcG May 06 '17 at 13:19
What is you point exactly regarding `__setitem__`? Thanks for your suggestion on the comprehension. You'll have to lose the square brackets around `line.split()` for your idea to work, and you will also have to join `parts[1:]` again to get the lines back as values. – Sebastiaan May 06 '17 at 13:46
1

If you want the remainder still joined as a single string, then change it to `self.file_dict = {parts[0]:parts[1] for line in file_obj for parts in [line.split(None, 1)]}`, so that you only do 1 split. Yes, you still need the square brackets. But for that matter, it is probably cleaner just using the dict constructor itself instead of contorting into a dict comprehension: `self.file_dict = dict(line.split(None, 1) for line in file_obj)`. – PaulMcG May 06 '17 at 14:05
Agreed, using the dict constructor with an iterable would be the most clean. I'll update my answer! I still don't get your point regarding `__setitem__` though. – Sebastiaan May 06 '17 at 15:12

PaulMcG · Answer 2 · 2017-05-06T17:52:52.560

Here is a class that scans the file and simply caches the file offsets. Lines are only processed when their keys are accessed. __getitem__ caches the processed lines.

class DataFileDict:
    def __init__(self, datafile):
        self._index = {}
        self._file = datafile

        # build index of key-file offsets
        loc = self._file.tell()
        for line in self._file:
            key = line.split(None, 1)[0]
            self._index[key] = loc
            loc = self._file.tell()

    def __getitem__(self, key):
        retval = self._index[key]
        if isinstance(retval, int):
            self._file.seek(retval)
            line = self._file.readline()
            retval = self._index[key] = list(map(float, line.split()[1:]))
            print("read and return value for {} from file".format(key))
        else:
            print("returning cached value for {}".format(key))
        return retval

if __name__ == "__main__":
    from io import StringIO

    sample = StringIO("""\
A 1 2 3 4 5
B 6 7 8 9 10
C 5 6 7 8 1 2 3 4 5 6 7
""")

    reader = DataFileDict(sample))
    print(reader['A'])
    print(reader['B'])
    print(reader['A'])
    print(reader['C'])
    print(reader['D'])  # KeyError

prints

read and return value for A from file
[1.0, 2.0, 3.0, 4.0, 5.0]
read and return value for B from file
[6.0, 7.0, 8.0, 9.0, 10.0]
returning cached value for A
[1.0, 2.0, 3.0, 4.0, 5.0]
read and return value for C from file
[5.0, 6.0, 7.0, 8.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Traceback (most recent call last):
  File "C:/Users/ptmcg/.PyCharm2017.1/config/scratches/scratch.py", line 64, in <module>
    print(reader['D'])  # KeyError
  File "C:/Users/ptmcg/.PyCharm2017.1/config/scratches/scratch.py", line 28, in __getitem__
    retval = self._index[key]
KeyError: 'D'

Python: processing lines of a large document on the fly

2 Answers2