Fast way to read interleaved data?

Question

I've got a file containing several channels of data. The file is sampled at a base rate, and each channel is sampled at that base rate divided by some number -- it seems to always be a power of 2, though I don't think that's important.

So, if I have channels a, b, and c, sampled at divders of 1, 2, and 4, my stream will look like:

a0 b0 c0 a1 a2 b1 a3 a4 b2 c1 a5 ...

For added fun, the channels can independently be floats or ints (though I know for each one), and the data stream does not necessarily end on a power of 2: the example stream would be valid without further extension. The values are sometimes big and sometimes little-endian, though I know what I'm dealing with up-front.

I've got code that properly unpacks these and fills numpy arrays with the correct values, but it's slow: it looks something like (hope I'm not glossing over too much; just giving an idea of the algorithm):

for sample_num in range(total_samples):
    channels_to_sample = [ch for ch in all_channels if ch.samples_for(sample_num)]
    format_str = ... # build format string from channels_to_sample
    data = struct.unpack( my_file.read( ... ) ) # read and unpack the data
    # iterate over data tuple and put values in channels_to_sample
    for val, ch in zip(data, channels_to_sample):
        ch.data[sample_num / ch.divider] = val

And it's slow -- a few seconds to read a 20MB file on my laptop. Profiler tells me I'm spending a bunch of time in Channel#samples_for() -- which makes sense; there's a bit of conditional logic there.

My brain feels like there's a way to do this in one fell swoop instead of nesting loops -- maybe using indexing tricks to read the bytes I want into each array? The idea of building one massive, insane format string also seems like a questionable road to go down.

Update

Thanks to those who responded. For what it's worth, the numpy indexing trick reduced the time required to read my test data from about 10 second to about 0.2 seconds, for a speedup of 50x.

If ch.samples_for is the problem, then we would need to see that function to know how to help you out there. How many channels do you have by the way? — Justin Peel, Nov 19 '10 at 18:24
Are you only actually reading one channel each time through the loop right now? At first I thought that wasn't the case, but now it isn't clear to me. — Justin Peel, Nov 19 '10 at 19:33
Ah, the files are from an embedded system that reads physiological data (heart rate, respiration, etc) along with some digital signals -- we're doing functional MRI and psychological testing while recording physio signals. This is just what we get. — Nate, Nov 19 '10 at 20:27
Right now, I'm reading, in groups, `(a0 b0 c0) (a1) (a2 b1) (a3) (a4 b2 c1) (a5)` — Nate, Nov 19 '10 at 20:29
It's pretty elegantly solved now, but we've got somewhere between 1 and 24 channels. — Nate, Nov 19 '10 at 21:51

Sven Marnach · Accepted Answer · 2010-12-03T20:40:43.357

7

The best way to really improve the performance is to get rid of the Python loop over all samples and let NumPy do this loop in compiled C code. This is a bit tricky to achieve, but it is possible.

First, you need a bit of preparation. As pointed out by Justin Peel, the pattern in which the samples are arranged repeats after some number of steps. If d_1, ..., d_k are the divisors for your k data streams and b_1, ..., b_k are the sample sizes of the streams in bytes, and lcm is the least common multiple of these divisors, then

N = lcm*sum(b_1/d_1+...+b_k/d_k)

will be the number of bytes which the pattern of streams will repeat after. If you have figured out which stream each of the first N bytes belongs to, you can simply repeat this pattern.

You can now build the array of stream indices for the first N bytes by something similar to

stream_index = []
for sample_num in range(lcm):
    stream_index += [i for i, ch in enumerate(all_channels)
                     if ch.samples_for(sample_num)]
repeat_count = [b[i] for i in stream_index]
stream_index = numpy.array(stream_index).repeat(repeat_count)

Here, d is the sequence d_1, ..., d_k and b is the sequence b_1, ..., b_k.

Now you can do

data = numpy.fromfile(my_file, dtype=numpy.uint8).reshape(-1, N)
streams = [data[:,stream_index == i].ravel() for i in range(k)]

You possibly need to pad the data a bit at the end to make the reshape() work.

Now you have all the bytes belonging to each stream in separate NumPy arrays. You can reinterpret the data by simply assigning to the dtype attribute of each stream. If you want the first stream to be intepreted as big endian integers, simply write

streams[0].dtype = ">i"

This won't change the data in the array in any way, just the way it is interpreted.

This may look a bit cryptic, but should be much better performance-wise.

edited Dec 03 '10 at 20:40

answered Nov 19 '10 at 19:56

Sven Marnach

574,206
118
941
841

My stream may contain mixed data types (int16 and float64) -- and data is probably an array of bytes. Will this approach still be possible? – Nate Nov 19 '10 at 20:11
Oh, I just noticed I did not read your code carefully enough. I will update my answer to reflect this... and yes, it is still possible... – Sven Marnach Nov 19 '10 at 20:17
Sweet. This looks like it's what I want -- I'm gonna need to do some thinking, though, before it totally makes sense. One thing I'm confused by a bit: "which stream the first N bytes belong to" -- the first N bytes belong to any one stream; a run of N bytes is guaranteed to belong to every stream. – Nate Nov 19 '10 at 20:49
Oh, this seems to be too confusing. English is not my first language, but I wil try to improve the explanation... – Sven Marnach Nov 19 '10 at 20:56
No worries -- really, I should just try coding it up and see where that takes me. Basically, I'm gonna read the whole thing, split it up into blocks N bytes long, and then do some indexing into it, right? – Nate Nov 19 '10 at 21:07
Yeah, that's the idea. It sounds a lot easier the way you put it :) – Sven Marnach Nov 19 '10 at 21:11
Ah! In my sample dataset, stream_index would look like `[0, 1, 2, 0, 0, 1, 0, 0, 1, 2]`. Brilliant! I will need to pad to the next multiple of N... but this should work perfectly. Thanks! – Nate Nov 19 '10 at 21:20
Also: why the call to .nonzero()? My limited tests with generated data have this working without -- just want to make sure there's not an edge case I'm missing here. – Nate Nov 19 '10 at 21:23
You are right, you can leave this out. In some situations, NumPy annoyingly converts bools to ints when indexing, but in this situation you are fine without `nonzero()`. I will edit the answer again :) – Sven Marnach Nov 19 '10 at 21:28
I cannot say how `stream_index` would look like for your sample dataset because I don't know the number of bytes each sample has. But before applying `repeat()`, it should look like you said. – Sven Marnach Nov 19 '10 at 21:31
Right -- in my comment, I was pretending they were each one byte. Really, if *a* and *c* are int16 and *b* is a float32, it'd be more like `[0 0 1 1 1 1 2 2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 2 2]` – Nate Nov 19 '10 at 21:47
One last note: the length of stream_index is actually equal to N, so the first calculation (which is probably the trickiest thing to understand in the algorithm) isn't needed. – Nate Nov 22 '10 at 19:58

Tobu · Answer 2 · 2010-11-19T18:31:33.280

2

Replace channel.samples_for(sample_num) with a iter_channels(channels_config) iterator that keeps some internal state and lets you read the file in one pass. Use it like this:

for (chan, sample_data) in izip(iter_channels(), data):
    decoded_data = chan.decode(sample_data)

To implement the iterator, think of a base clock with a period of one. The periods of the various channels are integers. Iterate the channels in order, and emit a channel if the clock modulo its period is zero.

for i in itertools.count():
    for chan in channels:
        if i % chan.period == 0:
            yield chan

edited Nov 19 '10 at 18:31

answered Nov 19 '10 at 18:26

Tobu

24,771
4
91
98

So, in this solution, I wouldn't be counting to total_samples (which is the total number of samples at the base rate) but to the total number of samples I expect in the file -- the sum of the lengths of all the channels. Right? – Nate Nov 19 '10 at 18:53
The clock can count to infinity, which is what itertools.count() does. izip will discard it when the file runs out. – Tobu Nov 19 '10 at 19:53
I can't rely on the file ending -- there's often crap after the end of the data I'm interested in. But I could compute the total number of "things I want to read" and count to that. – Nate Nov 19 '10 at 20:50

score 1 · Answer 3 · answered Nov 19 '10 at 18:19

1

The grouper() recipe along with itertools.izip() should be of some help here.

answered Nov 19 '10 at 18:19

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Fast way to read interleaved data?

Update

3 Answers3

Linked