3

I am writing a script to add a "column" to a Python list of lists at 500 Hz. Here is the code that generates test data and passes it through a separate thread:

fileA:

import random, time, threading

data = [[] for _ in range(4)]  # list with 4 empty lists (4 rows)
column = [random.random() for _ in data]  # synthetic column of data

def synthesize_data():
    while True:
        for x,y in zip(data,column):
            x.append(y)
        time.sleep(0.002)  # equivalent to 500 Hz

t1 = threading.Thread(target=synthesize_data).start()
# example of data
# [[0.61523098235, 0.61523098235, 0.61523098235, ... ],
# [0.15090349809, 0.15090349809, 0.15090349809, ... ],
# [0.92149878571, 0.92149878571, 0.92149878571, ... ],
# [0.41340918409, 0.41340918409, 0.41340918409, ... ]]

fileB (in Jupyter Notebook):

[1] import fileA, copy

[2] # get a copy of the data at this instant.
    data = copy.deepcopy(fileA.data)
    for row in data:
        print len(row)

If you run cell [2] in fileB, you should see that the lengths of the "rows" in data are not equal. Here is example output when I run the script:

8784
8786
8787
8787

I thought I might be grabbing the data in the middle of the for loop, but that would suggest that the lengths would be off by 1 at the most. The differences get more severe over time. My question: why is quickly adding columns to a list of lists unstable? Is it possible to make this process for stable?

You might suggest I use something like Pandas, but I want to use Python lists because of their speed advantage (the code needs to be as fast as possible). I tested the for loop, map() function, and Pandas data frame. Here is my test code (in Jupyter Notebook):

import pandas as pd
import random

channels = ['C3','C4','C5','C2']
a = [[] for _ in channels]
b = [random.random() for _ in a]

def add_col((x,y)):
    x.append(y)

df = pd.DataFrame(index=channels)
b_pandas = pd.Series(b, index=df.index)

%timeit for x,y in zip(a,b): x.append(y)  # 1000000 loops, best of 3: 1.32 µs per loop
%timeit map(add_col, zip(a,b))  # 1000000 loops, best of 3: 1.96 µs per loop
%timeit df[0] = b  # 10000 loops, best of 3: 82.8 µs per loop
%timeit df[0] = b_pandas  # 10000 loops, best of 3: 58.4 µs per loop

You might also suggest that I append the samples to data as rows and then transpose when it's time to analyze. I would rather not do that also in the interest of speed. This code will be used in a brain-computer interface, where analysis happens in a loop. Transposing would also have to happen in the loop, and this would get slow as the data grows.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
jkr
  • 17,119
  • 2
  • 42
  • 68

1 Answers1

7

The deepcopy() operation is copying lists as they are modified by another thread, and each copy operation takes a small amount of time (longer as the lists grow larger). So between copying the first of the 4 lists and copying the second, the other thread added 2 elements, indicating that copying a list of 8784 elements takes between 0.002 and 0.004 seconds.

That's because there is nothing preventing threading to switch between executing synthesize_data() and the deepcopy.copy() call. In other words, your code is simply not thread-safe.

You'd have to coordinate between your two threads; using a lock for example:

In fileA:

# ...
datalock = threading.RLock()
# ...

def synthesize_data():
    while True:
        with datalock:
            for x,y in zip(data,column):
                x.append(y)
            time.sleep(0.002)  # equivalent to 500 Hz

and in fileB:

with fileA.datalock:
    data = copy.deepcopy(fileA.data)
    for row in data:
        print len(row)

This ensures that copying only takes place when the thread in fileA is not trying to add more to the lists.

Using locking will slow down your operations; I suspect the pandas assignment operations are already subject to locks to keep them thread-safe.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Would +1 if I had enough rep... Thank you very much. Clear, concise, and helpful answer. For what it's worth, instead of using `copy.deepcopy()`, I modified my code to build the list from scratch with list comprehension `[[item for item in row] for row in fileA.data]`. `copy.deepcopy()` took 1.73 s on a nested list with 500000 columns, whereas rebuilding the list with list comprehension took 216 ms. Thanks for alerting me to the slowness of deepcopy! – jkr Sep 27 '16 at 18:34
  • @Jakub: you could use `[row[:] for row in fileA.data]` too; you are creating a shallow copy of the nested lists. deepcopy has to 'copy' each value in the nested list individually, and it doesn't have domain knowledge of how deep is deep enough. Note that a list comp won't protect you from the threading issues, just make it a little less likely. – Martijn Pieters Sep 27 '16 at 18:54
  • ♦: thanks for that modification. Your code is 15 times faster than mine! – jkr Sep 27 '16 at 19:44