0

I have a table that runs some heavier computation (process length ~ 5 minutes per key). I want to reserve jobs and run it on multiple machines. I noticed that computers get locked out from the table as soon as one machine starts processing a job - they effectively have to wait until one of the jobs finished before it starts its own, or gets a chance to grab a job. Where does this behavior stem from? I seem to run into "Lock wait timeout exceeded errors" on other machines then the one that is currently processing a job when the job is taking too long.

@schema
class HeavyComputation(dj.Computed):
    definition = """
    # ... 
    -> Table1
    class_label      :    varchar(25)      
    -> Table2.proj(somekey2="somekey")
    ---
    analyzed  :    longblob         

I am running .populate() on the table with

settings = {"display_progress": True, 
            "reserve_jobs": True,
            "suppress_errors": True,
            "order": "random"}
Horst
  • 167
  • 1
  • 6

2 Answers2

1

Yes, this is a tricky problem with how transaction serialization works. I will explain in a bit more detail and provide additional background but the solution is to reorder the primary key attributes in the table:

@schema
class HeavyComputation(dj.Computed):
    definition = """
    # ... 
    -> Table1
    -> Table2.proj(somekey2="somekey")
    class_label      :    varchar(25)      
    ---
    analyzed  :    longblob

Again, I will provide a detailed explanation later since it will take some time to write up. I did not want to make you wait.

  • 1
    This actually did not solve it for me, Dimitri. I changed the order of the keys as suggested and also made Table2.proj() a non primary key attribute, but the populate calls are still locking each other out. Any ideas? – Horst Feb 08 '23 at 22:32
1

The problem turned out to be a .delete() call inside a sub function of my make function. I am taking track of temporary files inside another (unrelated) table and wanted things to be cleaned once the make routine finishes. However, this .delete was running into a table lock and thereby prevented the .populate call to finish.

Horst
  • 167
  • 1
  • 6