What's the name of this programming pattern?

Question

Often one needs to process a sequence of "chunks", which are read from a stream of "atoms", where each chunk consisting of a variable number of atoms, and there is no way for the program to know that it has received a complete chunk until it reads the first atom of the next chunk (or the stream of atoms becomes exhausted).

A straightforward algorithm for doing this task would look like this:

LOOP FOREVER:
    SET x TO NEXT_ATOM
    IF DONE(x) OR START_OF_CHUNK(x):
        IF NOT EMPTY(accum):
            PROCESS(accum)
        END
        if DONE(x):
            BREAK
        END
        RESET(accum)
    END
    ADD x TO accum
END

So, my question is this:

Is there a name for this general class of problems and/or for the programming pattern shown above?

The remainder of this post are just a couple of (reasonably realistic) examples of what's described abstractly above. (The examples are in Python, although they could be translated easily to any imperative language.)

The first one is a function to produce a run-length encoding of an input string. In this case, the "atoms" are individual characters, and the "chunks" are maximal runs of the same character. Therefore, the program does not know that it has reached the end of a run until it reads the first character in the following run.

def rle(s):
    '''Compute the run-length encoding of s.'''

    n = len(s)
    ret = []
    accum = 0
    v = object()  # unique sentinel; ensures first test against x succeeds
    i = 0
    while True:
        x = s[i] if i < n else None
        i += 1
        if x is None or x != v:
            if accum > 0:
                ret.append((accum, v))
            if x is None:
                break
            accum = 0
        v = x
        accum += 1

    return ret

The second example is a function that takes as argument a read handle to a FASTA-formatted file, and parses its contents. In this case, the atoms are lines of text. Each chunk consists of a specially-marked first line, called the "defline" (and distinguished by a '>' as its first character), followed by a variable number of lines containing stretches of nucleotide or protein sequence. Again, the code can detect the end of a chunk unambiguously only after reading the first atom (i.e. the defline) of the next chunk.

def read_fasta(fh):
    '''Read the contents of a FASTA-formatted file.'''

    ret = []
    accum = []
    while True:
        x = fh.readline()
        if x == '' or x.startswith('>'):
            if accum:
                ret.append((accum[0], ''.join(accum[1:])))
            if x == '':
                break
            accum = []
        accum.append(x.strip())

    return ret

Your question is too specific I think for a design pattern. They tend to be more general. For example, you may use some callback to process the loop, using a Strategy pattern, so the same basic code could be used in various instances just by changing the callback function. — James Black, Sep 07 '11 at 00:22
That's why I did not call it a "design pattern". I used "programming pattern" to avoid that confusion. — kjo, Sep 07 '11 at 01:32

score 2 · Answer 1 · answered Sep 07 '11 at 00:27

2

the only thing i can think of is that it's a very simple LL(1) parser. you are parsing (in a very simple way) data from left to right and you need to lookahead one value to know what is happening. see http://en.wikipedia.org/wiki/LL_parser

answered Sep 07 '11 at 00:27

andrew cooke

45,717
10
93
143

You could also think of this as a modified LR parser as well, or any directional, non-backtracking parser. – templatetypedef Sep 07 '11 at 21:58

score 1 · Answer 2 · answered Sep 07 '11 at 06:22

I implement this pattern regularly (esp. in conjunction with sorting) when aggregating simple statistics in indexing. I've never heard of a formal name but at our company internally we simply refer to it as "batching" or "grouping", after the SQL GROUP BY clause.

In our system batches are usually delimited by an extracted attribute (rather than a bald edge-driven predicate) which we call the batch or group key. By contrast your examples seem to check for explicit delimiters.

score 1 · Answer 3 · edited May 23 '17 at 10:24

I believe that what you're describing is something called a streaming algorithm, an algorithm where the input is specified one element at a time until some stop condition is triggered. Streaming algorithms can be used to model algorithms where data is received over a network or from some device that generates data. Often, streaming algorithms assume that there is some fixed bound on the amount of memory that can be stored at any point in time, meaning that the algorithm needs to take special care to preserve important information while discarding useless data.

Many interesting algorithms in computer science fail to work in the streaming case, and there are a large class of algorithms specially designed to work for streams. For example, there are good streaming algorithms for finding the top k elements of a stream (see this question, for example), for randomly choosing k elements out of a stream, for finding elements that appear with high frequency in a stream, etc. One of the other answers to this question (from @andrew cooke) mentions that this resembles LL parsing. Indeed, LL parsing (and many other parsing algorithms, such as LR parsing) are streaming algorithms for doing parsing, but they're special cases of the more general streaming algorithm framework.

Hope this helps!

What's the name of this programming pattern?

3 Answers3