Rotating Large Arrays, Fastest Possible

Question

I am relatively new to Python and looking for best optimized code for rotating large multi-dimensional arrays. In the following code I have a 16X600000 32bit floating point multi-dimensional array and according to the timer it takes about 30ms to rotate the contents on my quad core acer windows 8 tablet. I was considering using some Cython routines or something similar if it would be possible to reduce the time required to rotate the array.

Eventually the code will be used to store y-axis values for a high speed data plotting graph based around the VisPy package and the 32bit float array will be passed to an OpenGL routine. I would like to achieve less than 1ms if possible.

Any comments, recommendations or sample code would be much appreciated.

import sys, timeit
from threading import Thread
from PyQt4 import  QtGui
import numpy as np

m = 16              # Number of signals.
n = 600000          # Number of samples per signal.
y = 0.0 * np.random.randn(m, n).astype(np.float32) 
Running = False

class Window(QtGui.QWidget):
    def __init__(self):
        QtGui.QWidget.__init__(self)
        self.button = QtGui.QPushButton('Start', self)
        self.button.clicked.connect(self.handleButton)
        layout = QtGui.QVBoxLayout(self)
        layout.addWidget(self.button)

    def handleButton(self):
        global Running, thread, thrTest
        if Running == True:
            Running = False
            self.button.setText('Start')
            thrTest.isRunning = False
            print ('stop')
        else:
            Running = True
            self.button.setText('Stop')
            thrTest = testThread()
            thread = Thread(target=thrTest.run, daemon=True )
            thread.start()
            print ("Start")   

class testThread(Thread):
    def __init__(self):
        self.isRunning = True
    def run(self):
        print('Test: Thread Started')
        while self.isRunning == True:
            start_time = timeit.default_timer()
            y[:, :-1] = y[:, 1:]
            elapsed = timeit.default_timer() - start_time
            print ('Time (s)= ' + str(elapsed))
        print('Test: Closed Thread')


if __name__ == '__main__':
    app = QtGui.QApplication(sys.argv)
    window = Window()
    window.show()
    sys.exit(app.exec_())

Update

I guess there has been some confusion about exactly what I am trying to do so I will try to explain a little better.

The ultimate goal is to have a fast real-time data logging device which draws line on a graph representing the signal value. There will be multiple channels and a sampling rate of at least 1ms and as much recording time as possible. I have started with this VisPy example. The code in the example which writes the new data into the arrays and sends it to OpenGL is in the On_Timer function near the bottom. I have modified this code slightly to integrate the OpenGL canvas into a Qt gui and added some code to get data from an Arduino Mega through an ethernet socket.

Currently I can produce a real time graph of 16 lines with a sampling rate right about 1ms and a frame rate of around 30Hz with a recording time of about 14 seconds. If I try to increase the channel count or the recording length any more the program stops working as it cannot keep up with the flow of data coming in through the Ethernet port at 1ms.

The biggest culprit I can find for this is the time it takes to complete the data buffer shift using the y[:, :-1] = y[:, 1:] routine. Originally I submitted benchmark code where this function was being timed in the hope that someone knew of a way to do the same thing in a more efficient manner. The purpose of this line is to shift the entire array one index to the left, and then in my very next line of code I write new data to the first slot on the right.

Below you can see my modified graph update routine. First it takes the new data from the queue and unpacks into a temporary array, then it shifts the contents of the main buffer array, and finally it copies the new data into the last slot of the main array. Once the queue is empty it calls the update function so that OpenGL updates the display.

def on_timer(self, event):
    """Add some data at the end of each signal (real-time signals)."""
    k=1
    s = struct.Struct('>16H')
    AdrArray =  0.0 * np.random.randn(16,1).astype(np.float32)
    if not q.qsize() == 0:
        while q.qsize() > 0:
            print (q.qsize())
            print ('iin ' + str(datetime.datetime.now()))
            AdrArray[:,0]= s.unpack_from(q.get(), offset=4)
            y[:, :-1] = y[:, 1:]
            y[:, -1:] = .002*AdrArray 
            print ('out ' + str(datetime.datetime.now()))
        self.program['a_position'].set_data(y.ravel().astype(np.float32))
        self.update()

Is it really necessary to rotate the data? Maybe you could just view the data through a sliding window instead. — Janne Karila, Jan 07 '15 at 18:05
Hi Janne, you may be correct but i cannot visualize how to accomplish that. I have tested several python packages for graphing (pyqtgraph, galry, Vispy) and found Vispy to be the most promising and active solution. With that said my end goal of being able to display a line chart with as many channels as possible with as little as 1ms sampling rate per channel it seems to me the only real answer for best performance is to use one of these openGL based packages and build upon them. — JMD, Jan 07 '15 at 20:25
Currently I have a solution using parts of this code here which rotate an array that is 16X18,000 giving me basically 16 data channels with a sampling rate of 1.0ms which equates to 18 seconds of data and it works very well. — JMD, Jan 07 '15 at 20:27
I would however like to increase either the channel count and or the amount of data that can be recorded and so far the single limiting factor I have run into is the speed of the "y[:, :-1] = y[:, 1:]" function as it becomes too slow when the array is larger then 16X18,000 (or 800ns to rotate). Im still learning myself as I only started with python less than a month ago but I believe the data in this array is being loaded into an OpenGL VBO somehow inside the VisPy backend. — JMD, Jan 07 '15 at 20:27
I don't totally understand how you're planning on using the values in your array, but at first glance it seems like you really want to be doing this in OpenGL rather than numpy. One trick I've used before is to represent the 1D array as an OpenGL texture with the wrap parameter in one dimension set to repeat [(e.g. here)](https://open.gl/textures). You can then 'roll' through your values by translating in texture coordinate space. This should be *blisteringly* fast compared with 'rolling' the array in numpy. — ali_m, Jan 07 '15 at 22:36
Also I suggest that using the word 'rotate' in your question is a bit confusing - from the title it sounds like you want to do coordinate rotations. Personally I'd go with 'roll', 'shift' or 'wrap'. — ali_m, Jan 07 '15 at 23:01
I think maybe [this sort of approach](http://stackoverflow.com/a/10806453/1461210) is what you're looking for. — ali_m, Jan 08 '15 at 15:10
If your main goal is to draw on the screen, why are you worried about 600K pts/sample? This is a far beyond what anyone can perceive, and horizontal resolutions are typically in the thousands, not hundreds of thousands. Instead, downsample to something reasonable early on and work with that. — tom10, Jan 08 '15 at 16:57
@Tom10 At this point I am shooting for 1000 pts per second per channel. — JMD, Jan 08 '15 at 19:02
@ali_m I believe this is exactly what I am looking for, not sure I will be able to apply it within VisPy, looks like i would have to start my own openGL app — JMD, Jan 08 '15 at 19:03

hpaulj · Answer 1 · 2015-01-07T22:57:53.607

Do you really want this 'roll'? It's shifting the values left, filling with the last column

In [179]: y = np.arange(15).reshape(3,5)
In [180]: y[:,:-1]=y[:,1:]
In [181]: y
Out[181]: 
array([[ 1,  2,  3,  4,  4],
       [ 6,  7,  8,  9,  9],
       [11, 12, 13, 14, 14]])

For nonstandard 'roll' like this it is unlikely that there's anything faster.

np.roll has a different fill on the left

In [190]: np.roll(y,-1,1)
Out[190]: 
array([[ 1,  2,  3,  4,  0],
       [ 6,  7,  8,  9,  5],
       [11, 12, 13, 14, 10]])

For what it's worth, the core of roll is:

indexes = concatenate((arange(n - shift, n), arange(n - shift)))
res = a.take(indexes, axis)

Your particular 'roll' could be reproduced with a similar 'take'

In [204]: indexes=np.concatenate([np.arange(1,y.shape[1]),[y.shape[1]-1]])
In [205]: y.take(indexes,1)

Your y[:,:-1]... is faster than roll because it does not create a new array; instead it just overwrites part of the existing one.

take accepts an out parameter, so this is possible:

y.take(indexes,1,y)

though speed wise this only helps with small arrays. For large ones your overwriting assignment is faster.

I'd also suggest looking at using the transpose, and rolling on the axis=0. For an order='C' array, the values of a row form a contiguous block.

The big time consumer is that you have to copy (nearly) all of the array from one location to another, either in a new array, or onto itself. If the data were in some sort of ring buffer, you could just change a pointer, without having to copy any data.

after the shift left the value at the far right end is overwritten with a brand new value that comes off the threaded Queue so that type of roll is fine! I have thought about some sort of ring buffer or queue type but then again the function the array gets sent to is expecting a 32bit float type array — JMD, Jan 08 '15 at 12:33

Almar · Answer 2 · 2015-01-12T12:30:54.340

1

As some others have mentioned, I don't think you want to shift the array on the CPU. Moving all the elements in the array on each update will always be slow. I don't expect Cython to help here either, since Numpy will be quite optimal already.

Instead, you want to take care of the shifting in the vertex shader. ~~An example of what I think is similar to what you want is here: http://vispy.org/examples/demo/gloo/realtime_signals.html~~

Edit: One approach is to consider your VBO a circular buffer. You add new values at the "oldest" location, and in the vertex shader you use an offset uniform to map the signals to the correct position. A similar technique is used in http://vispy.org/examples/demo/gloo/spacy.html

edited Jan 12 '15 at 12:30

answered Jan 08 '15 at 09:22

Almar

4,756
2
13
9

Actually that is exactly where I got the basis of my code from. If you look closely at the "def On_Timer" event where they actually update the signals it is using exactly this shift function then takes the array and loads it into the VBO with the program.setdata call! – JMD Jan 08 '15 at 12:16
And again the Numpy Roll Function is actually 2.4x slower then this function i am showing here. I've also seen code on google benchmarked native functions, Numpy functions and custom C functions and the C functions were almost always an order of magnitude quicker, just havnt found any C code written especially for shifting float arrays yet and unsure of how to integrate a C function into my Python project as i have not done that yet! – JMD Jan 08 '15 at 12:19
@JMD So on every frame you're making a shifted copy of the numpy array then copying the whole thing back up to the VBO? That sounds like a really inefficient use of bandwidth! Why not just update the way you are indexing the elements in the VBO on each frame, rather than copying the whole array again and again? – ali_m Jan 08 '15 at 14:40
from my understanding the Y array is shifted left one index and new data is loaded into the far right end of the array. currently my new data is coming from reading the analog inputs of an Arduino Mega then sending the values in an array over Ethernet to my python program running on my PC, this array from the arduino is then copied onto the end of the large Y array after it has been shifted. I am using slightly modified code from the VisPy Examples, if I had more openGL experience I suppose I could comment on what your saying – JMD Jan 08 '15 at 14:55
@Ali_m I did read the link you gave about OpenGL Textures but I can't visualize how I could hold this large array inside the gpu and tell the gpu to shift the data over one index then tell the GPU to copy 1 new column of data onto the end of its existing array? – JMD Jan 08 '15 at 14:58
1

@JMD Based on the wording of your question I had no idea you were inserting new values into the array as well as shifting the existing ones. I think this is a bit of an [XY problem](http://mywiki.wooledge.org/XyProblem) - I'm sure there's a better approach, but you're unlikely to get a good answer here unless you explain in a bit more detail what exactly you're trying to do. At least give the link to the vispy demo code in your question. – ali_m Jan 08 '15 at 15:07
@JMD My mistake; I assumed that the example would be more efficient. I edited my comment to add one possible approach. – Almar Jan 12 '15 at 12:31

score 1 · Answer 3 · answered Jan 19 '15 at 17:04

The realtime-signals.py example in glumpy implements a ring buffer and might help you:

https://github.com/glumpy/glumpy/blob/master/examples/realtime-signals.py

It is easily adaptable (and will be soon adapted) for vispy as well. The trick is to use a Fortran-like layout such that updating the 16 signals result in uploading a contiguous block of 16 floats to GPU. The example should handle yours 16 x 600,000 data if it fits into GPU memory (only tested 16 x 100,000 at 250 fps).

Note that because you have a limited horizontal screen resolution, this might prevent you from seeing the whole data (in the end only 1600 data will be displayed for each signal if your screen is 1600 pixels wide).

Rotating Large Arrays, Fastest Possible

Update

3 Answers3