Speed up loading 24-bit binary data into 16-bit numpy array

Question

I use the following code to load 24-bit binary data into a 16-bit numpy array :

temp = numpy.zeros((len(data) / 3, 4), dtype='b')
temp[:, 1:] = numpy.frombuffer(data, dtype='b').reshape(-1, 3)
temp2 = temp.view('<i4').flatten() >> 16       # >> 16 because I need to divide by 2**16 to load my data into 16-bit array, needed for my (audio) application
output = temp2.astype('int16')

I imagine that it's possible to improve the speed efficiency, but how?

Note : `data` is read from a binary file with `open`, module `chunk` and `chunk.read` — Basj, Mar 02 '14 at 15:05

amaurea · Accepted Answer · 2014-03-02T17:25:54.047

4

It seems like you are being very roundabout here. Won't this do the same thing?

output = np.frombuffer(data,'b').reshape(-1,3)[:,1:].flatten().view('i2')

This would save some time from not zero-filling a temporary array, skipping the bitshift and avoiding some unneceessary data moves. I haven't actually benchmarked it yet, though, and I expect the savings to be modest.

Edit: I have now performed the benchmark. For len(data) of 12 million, I get 80 ms for your version and 39 ms for mine, so pretty much exactly a factor 2 speedup. Not a very big improvement, as expected, but then your starting point was already pretty fast.

Edit2: I should mention that I have assumed little endian here. However, the original question's code is also implicitly assuming little endian, so this is not a new assumption on my part.

(For big endian (data and architecture), you would replace 1: by :-1. If the data had a different endianness than the CPU, then you would also need to reverse the order of the bytes (::-1).)

Edit3: For even more speed, I think you will have to go outside python. This fortran function, which also uses openMP, gets me a factor 2+ speedup compared to my version (so 4+ times faster than yours).

subroutine f(a,b)
        implicit none
        integer*1, intent(in)  :: a(:)
        integer*1, intent(out) :: b(size(a)*2/3)
        integer :: i
        !$omp parallel do
        do i = 1, size(a)/3
                b(2*(i-1)+1) = a(3*(i-1)+2)
                b(2*(i-1)+2) = a(3*(i-1)+3)
        end do
        !$omp end parallel do
end subroutine

Compile with FOPT="-fopenmp" f2py -c -m basj{,.f90} -lgomp. You can then import and use it in python:

import basj
def convert(data): return def mine2(data): return basj.f(np.frombuffer(data,'b')).view('i2')

You can control the number of cores to use via the environment variavble OMP_NUM_THREADS, but it defaults to using all available cores.

edited Mar 02 '14 at 17:25

answered Mar 02 '14 at 15:30

amaurea

4,950
26
35

Thank you. I am currently trying to benchmark it, I'll give you the results. (Remark : I need the result as a contiguous array, so using a view is not possible) – Basj Mar 02 '14 at 15:38
The result is contiguous in my test, according to `.flags`. If you want to be completely sure that it's always contiguous (though I can't see how it could vary), you could wrap it in `np.ascontiguousarray`, which is a free operation if the array is already contiguous, but creates a contiguous copy otherwise. – amaurea Mar 02 '14 at 15:40
2

THe array is made contiguous by [`flatten`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html), which always makes a copy. Here `view` simply reinterprets every 2 single byte numbers as one 2 byte number. – Jaime Mar 02 '14 at 15:42
Ok for contiguous (by the way is there a `numpy.iscontiguous` which returns `True` or `False` ?) About benchmark, I get the opposite result (that's very strange !) : http://pastebin.com/ju0GjLE2 Do you have an idea for the reason @amaurea – Basj Mar 02 '14 at 15:48
@Basj what times do you get for which data length for each method? I used `timeit` in `ipython` to perform the benchmark, which I think should be pretty reliable (it repeats the test several times to get a reliable measurement). – amaurea Mar 02 '14 at 15:51
The only concern with your solution is endianess. I think your code will only work with big-endian data, as you are taking the last two bytes to be the least significant ones that contain the 16bit number. With little endian data, which is what most PCs are, the slicing would have to be `[:, :-1]`. Or maybe its the other way around, but there is an endianess assumption which needs to be specified. – Jaime Mar 02 '14 at 15:53
For checking if something is contiguous: I only know of `.flags['C_CONTIGOUS']`. – amaurea Mar 02 '14 at 15:53
@Jaime Yes, I assume little endian here. But so does the original code. On a big endian system, his function would throw away everything but the most significant byte. I'll edit my answer to make this clear, though. – amaurea Mar 02 '14 at 15:56
@amaurea : I get these numbers : `%timeit binary24_to_int16(data) # 1000 loops, best of 3: 565 µs per loop` / `%timeit binary24_to_int16_2(data) # 1000 loops, best of 3: 651 µs per loop`, see http://pastebin.com/ju0GjLE2 – Basj Mar 02 '14 at 15:58
Thanks for the edits. Can you paste the lines of code you used for benchmarking ? (I would like to solve the problem I have : the benchmark gives me opposite conclusion !) – Basj Mar 02 '14 at 16:05
@Bash With your very short array, I get 492 µs for yours and 411 µs for mine. Our numbers are close, but don't quite agree. What do you get for a long array? – amaurea Mar 02 '14 at 16:06
With my short array `data = 'ABCUYDSFISUDDSFDSF765647' * 5000` I get 564µs with my method and 640µs for yours.... With your long array, I get 71ms for mine, and 65ms for yours... – Basj Mar 02 '14 at 16:18
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/48804/discussion-between-amaurea-and-basj) – amaurea Mar 02 '14 at 16:19

score 1 · Answer 2 · answered Mar 02 '14 at 18:39

Inspired by @amaurea's answer, here is a cython version (I already used cython in my original code, so I'll continue with cython instead of mixing cython + fortran) :

import cython
import numpy as np
cimport numpy as np

def binary24_to_int16(char *data):
    cdef int i
    res = np.zeros(len(data)/3, np.int16)
    b = <char *>((<np.ndarray>res).data)
    for i in range(len(data)/3):
        b[2*i] = data[3*i+1]
        b[2*i+1] = data[3*i+2]
    return res

There is a factor 4 speed gain :)

Speed up loading 24-bit binary data into 16-bit numpy array

2 Answers2