Read from file one element at a time Python

Question

I have a file that is not structured on a line-by-line basis, but rather in groups of different sizes that wrap to the next line. I won't go into more detail since it doesn't really matter. Suffice to say lines don't mean anything structurally.

My question is this: is there a way to read from a file element-by-element, rather than line-by-line? I'm pretty sure it's unpythonic to not do line-by-line, but I'd rather not have to read each line and concatenate it with the previous line and then process that. If there's a simple way to read each element at a time it would make things a lot easier. Sorry if this has been asked before, I really couldn't find anything. Thanks!

EDIT: I'll add a simple example

file looks like this:

1.00 3 4.3 5.6 2.3 4 12.4 0.5 10.2 1.10 8
5.9 11.2 7.3 1.20 8 0.2 1.2 4.2 11 23.1 4.0
7.3 13 4.4 1.7 0.5 (etc.)

The groups start with 1.00, 1.10, 1.20 (always increase by 0.1)

Possible duplicate of [Reading in file block by block using specified delimiter in python](http://stackoverflow.com/questions/38655176/reading-in-file-block-by-block-using-specified-delimiter-in-python) — Chris_Rands, Mar 21 '17 at 15:53
Do you know anything at all about the structure of the file? — Bill Bell, Mar 21 '17 at 15:53
The separator is a timetag, so I need to know the value of the next element to see if it's 0.1 seconds greater than the previous element, then I will know it's the next group. — Arthur Dent, Mar 21 '17 at 15:54
Also see: http://stackoverflow.com/questions/16260061/reading-a-file-with-a-specified-delimiter-for-newline — Chris_Rands, Mar 21 '17 at 15:56
@Chris_Rands my groups aren't separated as cleanly as in that example. I'll look more closely but I don't think it is the same case. — Arthur Dent, Mar 21 '17 at 15:57
@ArthurDent Your delimiter is a space (`' '`) according to your example and you can build a generator as in the linked duplicates — Chris_Rands, Mar 21 '17 at 15:58
@Chris_Rands Does it matter if there are an uneven number of spaces between the elements? It shouldn't, right? Harvey I edited my OP — Arthur Dent, Mar 21 '17 at 16:01
Can you guarantee that no data will ever occur between timestamps that equals the timestamp of the next group? So... `[1.0 3 4 1.1] [1.1 23 3.2]...` (brackets added for readability). — Harvey, Mar 21 '17 at 16:01
@ArthurDent If you split on whitespace you can ignore that. Anyway, I suggest you read the linked duplicates, then try to implement this yourself. Then come back with a new question to Stack Overflow if your code needs debugging, good luck! — Chris_Rands, Mar 21 '17 at 16:02
@Harvey I can't guarantee it won't have the same value, but it won't have the same precision. Is it possible to evaluate elements based on their precision? — Arthur Dent, Mar 21 '17 at 16:03
If lines don't matter, should numbers spanning a line be concatenated? Looking at the end the of the first line and beginning of the second in your example. Should that be two numbers 8 and 5.9, or one number 85.9? — tdelaney, Mar 21 '17 at 16:35
The numbers do get split by the line. They can even be "3." on one line and "0213" on another, representing the number 3.0213. — Arthur Dent, Mar 21 '17 at 18:15

Bill Bell · Answer 1 · 2017-03-21T21:15:07.590

If the numbers don't span record breaks then I think that this can be done more simply. This is your data.

1.00 3 4.3 5.6 2.3 4 12.4 0.5 10.2 1.10 8
5.9 11.2 7.3 1.20 8 0.2 1.2 4.2 11 23.1 4.0
7.3 13 4.4 1.7 0.5

Here's the code.

from decimal import Decimal

def records(currentTime=Decimal('1.00')):
    first = True
    with open('sample.txt') as sample:
        for line in sample.readlines():
            for number in line.split():
                if Decimal(number) == currentTime:
                    if first:
                        first = False
                    else:
                        yield record
                    record = [number]
                    currentTime += Decimal('0.1')
                else:
                    record.append(number)
    yield record

for record in records():
    print (record)

Here's the output.

['1.00', '3', '4.3', '5.6', '2.3', '4', '12.4', '0.5', '10.2']
['1.10', '8', '5.9', '11.2', '7.3']
['1.20', '8', '0.2', '1.2', '4.2', '11', '23.1', '4.0', '7.3', '13', '4.4', '1.7', '0.5']

EDIT: This version operates on the same lines but does not assume that numbers cannot span record breaks. It uses stream I/O. The main thing you would change would be the size of the gulps of data and, of course, the source.

from decimal import Decimal
from io import StringIO
sample = StringIO('''1.00 3 4.3 5.6 2.3 4 12.4 0.5 10.2 1.10 8 \n5.9 11.2 7.3 1.20 8\n.15 0.2 1.2 4.2 11 23.1 4.0 \n7.3 13 4.4 1.7 0.5''')

def records(currentTime=Decimal('1.00')):
    first = True
    previousChunk = ''
    exhaustedInput = False
    while True:
        chunk = sample.read(50)
        if not chunk: 
            exhaustedInput = True
            chunk = previousChunk
        else:
            chunk = (previousChunk + chunk).replace('\n', '')
        items = chunk.split()
        for number in items[:len(items) if exhaustedInput else -1]:
            if Decimal(number) == currentTime:
                if first:
                    first = False
                else:
                    yield record
                record = [number]
                currentTime += Decimal('0.1')
            else:
                record.append(number)
        if exhaustedInput:
            yield record
            break
        else:
            previousChunk = chunk.split()[-1]

for record in records():
    print (record)

Here is the output.

['1.00', '3', '4.3', '5.6', '2.3', '4', '12.4', '0.5', '10.2']
['1.10', '8', '5.9', '11.2', '7.3']
['1.20', '8.15', '0.2', '1.2', '4.2', '11', '23.1', '4.0', '7.3', '13', '4.4', '1.7', '0.5']

Unfortunately the numbers do span record breaks. – Arthur Dent Mar 21 '17 at 18:15 — Arthur Dent, Mar 21 '17 at 18:15
@ArthurDent: Modified to allow for spanning. – Bill Bell Mar 22 '17 at 03:03 — Bill Bell, Mar 22 '17 at 03:03

score 1 · Accepted Answer · edited May 23 '17 at 10:30

A generator solution using a custom header method. Loosely based on https://stackoverflow.com/a/16260159/47078.

Input:

' 1.00 3 4.3 5.6\n 2.3\n 4 12.4 0.5 10.2 1.10 8 5.9 11.2\n 7.3 1.20 8 0.2 1.2\n 4.2 11 23.1 4.0\n 7.3\n 13 4.4 1.7 0.5'

Output:

['1.00', '3', '4.3', '5.6', '2.3', '4', '12.4', '0.5', '10.2']
['1.10', '8', '5.9', '11.2', '7.3']
['1.20', '8', '0.2', '1.2', '4.2', '11', '23.1', '4.0', '7.3', '13', '4.4', '1.7', '0.5']

Source:

#!/usr/bin/env python3

from contextlib import suppress
from functools import partial

# yields strings from a file based on custom headers
#
# f                      a file like object supporting read(size)
# index_of_next_header   a function taking a string and returning
#                        the position of the next header or raising
#                        (default = group by newline)
# chunk_size             how many bytes to read at a time
def group_file_by_custom_header(f,
                                index_of_next_header=lambda buf: buf.index('\n') + 1,
                                chunk_size=10):
    buf = ''
    for chunk in iter(partial(f.read, chunk_size), ''):
        buf += chunk
        with suppress(ValueError):
            while True:
                pos = index_of_next_header(buf)
                yield buf[:pos]
                buf = buf[pos:]
    if buf:
        yield buf


# Pass an empty list to data
def index_of_next_timestamp(buf, data):
    def next_timestamp(buf):
        next_ts = buf.strip().split(maxsplit=2)
        if len(next_ts) < 2:
            raise ValueError()
        return '{:4.2f}'.format(float(next_ts[0]) + 0.1)

    if not data:
        data.append(next_timestamp(buf))
    pos = buf.index(data[0])
    data[0] = next_timestamp(buf[pos:])
    return pos

def get_dummy_file():
    import io
    data = ' 1.00 3 4.3 5.6\n 2.3\n 4 12.4 0.5 10.2 1.10 8 5.9 11.2\n 7.3 1.20 8 0.2 1.2\n 4.2 11 23.1 4.0\n 7.3\n 13 4.4 1.7 0.5'
    return io.StringIO(data)

data_file = get_dummy_file()

header_fn = partial(index_of_next_timestamp, data=[])
for group in group_file_by_custom_header(data_file, header_fn):
    print(repr(group.split()))

Thank you so much! The method looks good, although I'm using python 2.7 and contextlib doesn't have suppress. Do you know what I can use instead? — Arthur Dent, Mar 21 '17 at 18:26
I just replaced suppress() with try/except, and split(maxsplit=2) with split(None, 2), and it worked! Thank you again! — Arthur Dent, Mar 21 '17 at 18:43
Make sure to use a much bigger value like 4096 for `chunk_size`. I only used 10 so it would be trivial to see the algorithm in action. — Harvey, Apr 09 '17 at 00:13
It's a common OS and disk read block size. (I think). Just pick a nice multiple of two. — Harvey, Apr 10 '17 at 14:37

score 1 · Answer 3 · answered Mar 22 '17 at 17:48

I don't know why this didn't occur to me before. You can read more-or-less element by element using a lexical scanner. I've used the one that comes with Python, namely shlex. It has the virtue that it will operate on a stream input, unlike some of the more popular ones, I understand. This seems even simpler.

from io import StringIO
sample = StringIO('''1.00 3 4.3 5.6 2.3 4 12.4 0.5 10.2 1.10 8 \n5.9 11.2 7.3 1.20 8\n.15 0.2 1.2 4.2 11 23.1 4.0 \n7.3 13 4.4 1.7 0.5''')

from shlex import shlex
lexer = shlex(instream=sample, posix=False)
lexer.wordchars = r'0123456789.\n'
lexer.whitespace = ' '
lexer.whitespace_split = True

from decimal import Decimal

def records(currentTime=Decimal('1.00')):
    first = True
    while True:
        token = lexer.get_token()
        if token:
            token = token.strip()
            if not token:
                break
        else:
            break
        token = token.replace('\n', '')
        if Decimal(token) == currentTime:
            if first:
                first = False
            else:
                yield record
            currentTime += Decimal('0.1')
            record = [float(token)]
        else:
            record.append(float(token))
    yield record

for record in records():
    print (record)

Output is:

[1.0, 3.0, 4.3, 5.6, 2.3, 4.0, 12.4, 0.5, 10.2]
[1.1, 8.0, 5.9, 11.2, 7.3]
[1.2, 8.15, 0.2, 1.2, 4.2, 11.0, 23.1, 4.0, 7.3, 13.0, 4.4, 1.7, 0.5]

score 0 · Answer 4 · answered Mar 21 '17 at 16:32

If it were me, I'd write generator-function wrappers to provide precisely the level of detail required:

def by_spaces(fp):
    for line in fp:
        for word in line.split():
            yield word

def by_numbers(fp):
    for word in by_spaces(fp):
        yield float(word)

def by_elements(fp):
    fp = by_numbers(fp)
    start = next(fp)
    result = [start]
    for number in fp:
        if abs(start+.1-number) > 1e-6:
            result += [number]
        else:
            yield result
            result = [number]
            start = number
    if result:
        yield result

with open('x.in') as fp:
    for element in by_elements(fp):
        print (element)

Read from file one element at a time Python

4 Answers4