1

Assuming you are processing a live stream of data like this:

async_loading_of_queued_data_in_python_process

What would be the best way to have a background Thread to update the data variable while the main logic can do_some_logic within the endless loop?

I have some experience with clear start and end point of parallelization using multiprocessing/multithreading, but I am unsure how to continously execute a background Thread updating an internal variable. Any advice would be helpfull - Thanks!

gies0r
  • 4,723
  • 4
  • 39
  • 50
  • What kind of granularity do you need? Do you need to do_some_logic periodically (async) or every time a row is added? – wwii Mar 29 '20 at 00:08
  • @wwii Every time when a row (or a threshold X of rows) are added. Lets say, whenever 5 rows are added. – gies0r Mar 29 '20 at 00:09
  • 1
    The *live feed* and *Queue* exist? and you are trying to figure out how to update `data` via a thread??? – wwii Mar 29 '20 at 00:17
  • @wwii Exactly. I am pulling from a redis queue in block mode, which means that the background Thread is waiting until new rows are comming in. Than the background worker appends to the dataframe `data`. I would like to have the main process continously working and the background Thread doing the updating. – gies0r Mar 29 '20 at 00:19
  • Related: [python pandas dataframe thread safe?](https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe) – wwii Mar 29 '20 at 01:00

2 Answers2

2

Write an update function and periodically run a background thread.

def update_data(data):
    pass 
import threading
def my_inline_function(some_args):
    # do some stuff
    t = threading.Thread(target=update_data, args=some_args)
    t.start()
    # continue doing stuff

Understand the constraints of GIL so you know if threading is really what you need.

I'd suggest you to look into async/await to get a better idea of how threading actually works. It's a similar model to javascript: your main-program is single-threaded and it exploits IO-bound tasks to context switch into different parts of your application.

If this doesn't meet your requirements, look into multiprocessing - specifically, how to spin a new process and how to share variables between processes

nz_21
  • 6,140
  • 7
  • 34
  • 80
  • Well.. I am pretty aware of the GIL liimitations and from what I understand there should not be a big problem with locks. The background process writes, the main process reads. It is maybe important to take care of the latest index within the main process - so that the state of `data` is consistent within one `do_some_logic()` run. So I really would like to seperate the data pulling into one background thread (which is most of the time waiiting) and the logic part, where most of the CPU power needs to goes for. – gies0r Mar 29 '20 at 00:27
1

Have the background thread make separate DataFrames with data retrieved from the live feed that can be sent to the main thread and appended to a DataFrame in the main thread. The DataFrames should have the same structure.

  • Subclass threading.Thread
    • give it two attributes:
      • a reference to the live feed queue and
      • a reference to a main thread queue
    • in a continuous loop its run method should accumulate rows from the live feed queue in a dictionary
    • when a predetermined number of rows have been accumulated;
      • make a DataFrame from the dictionary
      • put the DataFrame on the main thread queue
      • make a new empty dictionary to be subsequently filled
  • In the main thread
    • make an empty DataFrame with the required columns
    • make a queue
    • make an instance of the Thread passing it the two queues
    • In a loop
      • check the queue: if anything is there append or concatenate it to the DataFrame
      • do stuff
wwii
  • 23,232
  • 7
  • 37
  • 77