2

I am writing a task scheduling module in java spring to handle task items which are stored in Mysql database.

Schema structure of the Task table:

ID | TASK_UUID | TASK_CONTENT(VARCHAR) | CREATED_TS | UPDATED_TS | STATUS(NEW/PROCESSING/COMPLETE)

I would like to implement multiple task scheduler workers to get the tasks for execution from Task table. In what way I can ensure the task schedulers would not get the same task for execution at the same time? Any good java framework I can make use of?

#Edit 1: The task execution module is designed to be run by different machines, so sychronized methods may not work.

#Edit 2: Each machine will get random or irregular numbers of task. So if auto-increment sequence is used, the allocation size of the index should be irregular too, otherwise there will be some tasks being never handled.

#Edit 3: Each machine is running with Quartz Scheduler, configured with a constant Task getting and executing job. The time interval between every job is about 10 seconds. So, my goal is to ensure each machine scheduler can fetch at least 10 tasks in every quartz job run.

Kaguya
  • 21
  • 2
  • 2
    Use the atomicity of transactions in the database. For instance, adding a "task scheduler id"-column and setting it, if it is null, would do the trick. If done in a transaction. – Erk Mar 16 '21 at 04:01
  • So @Erk is suggesting adding another column being effectively an ID identifying the machine the task is being processed on... – Mr R Mar 16 '21 at 04:07
  • What about tasks that stall / fail / machine processing it dies / etc.? How do you deal with them? (I'm asking questions to help you to an answer / help us all to an answer). – Mr R Mar 16 '21 at 04:08
  • @MrR, yes a timestamp column and timeout handling would probably also be needed. – Erk Mar 16 '21 at 04:13
  • @Erk Correct me if I misunderstand your idea. Adding a new column MachineId in Task table. Then I will create another singleton master machine keep on assigning MachineId to the task without a MachineId. For the other machines, they will just query their own task with corresponding MachineId. – Kaguya Mar 16 '21 at 04:36
  • 1
    @Kaguya unless you need to have a master, the DB is your sequencer, so any machine can request the next task, only 1 will get it. – Mr R Mar 16 '21 at 04:51
  • @MrR I got your point. Your approach seems to be work if the machines are getting the task one by one, so the next task sequence can be updated continuously by database itself. But how about if each machine is going to get random or irregular number of tasks? – Kaguya Mar 16 '21 at 05:34
  • @Kaguya - what if there are only 3 tasks scheduled (i.e. in the DB)? but you executor wants to run 4 tasks (take 4)? It doesn't have to run them sequentially - it's just about getting something assigned ... AND you could request multiple - just makes for more complicated SQL - bigger limit / needs to be correct SQL]. The alternative of doing everything in java land (other than saving the status) results in a really complicated multi-machine communication mechanism and seems like it could be very hard to debug.. – Mr R Mar 16 '21 at 05:44
  • One of the answers to this question on [SELECT FOR UPDATE](https://dba.stackexchange.com/questions/7363/concurrently-update-with-limit-mysql) suggests that LIMIT >1 runs a risk of deadlock.. YMMV. The design question you are trading off is throughput of assignment VS time to communicate with the DB and/or transaction start/end .. If the work you tasks needs to do - in java land - is big compared with the time to allocate - it doesn't really matter getting 1 vs 3. – Mr R Mar 16 '21 at 05:52
  • @MrR Agree. That would lead to some deadlock situations. What I am concerning now is, since the Task size is actually great in production environment, say, more than 10k tasks stacking up the queue. Hence, I would like to develop this module to finish all these queued tasks. In ideal case, I hope each task execution machine can fetch more than one tasks at the same time to minimize the database query. And because there is some time intervals between the task execution for each machine, it would be slow if one task is get each time. – Kaguya Mar 16 '21 at 07:48
  • @Kaguya, I would start with a simple proof of concept where each machine picks one task at a time and see if there even is a need for any optimization. – Erk Mar 16 '21 at 15:37

2 Answers2

1

You could create the method getTask as a synchronized method:

Eg:

synchronized Task getTask() {
  // get NEW task from DB
  // update status to PROCESSING
  // return task
}

#Edit 1: If so, just use SELECT FOR UPDATE query to block the others query to access the same task. Eg:

SELECT * FROM Task t WHERE t.status = NEW ORDER BY t.created_ts LIMIT 1 FOR UPDATE;
UPDATE Task SET status = PROCESSING WHERE id = <the task id> .

You could create a procedure to wrap the queries.

linhx
  • 61
  • 4
  • mysql is quite a heavy way to do this - so I'm wondering is it all running in the one process (or can multiple machines schedule &/or execute tasks) - which if was the case wouldn't work with synchronized. – Mr R Mar 16 '21 at 03:39
  • @MrR You are right. The module will be run by multiple machines or clusters. – Kaguya Mar 16 '21 at 03:49
  • So @Kaguya says multiple machines / clusters - so the database is the way to ensure multiple users across multiple machines can't get the same task. Use a query to UPDATE the next task to run - return the details, and set to running. – Mr R Mar 16 '21 at 03:52
  • @MrR how about if two machines are going to update the same Task to running? Will there be race conditions in this case? – Kaguya Mar 16 '21 at 03:59
  • 1
    If so, just use SELECT FOR UPDATE query to block the others query to access the same task. Eg: SELECT * FROM Task t WHERE t.status = NEW ORDER BY t.created_ts LIMIT 1 FOR UPDATE; UPDATE Task SET status = PROCESSING WHERE id = . You could create a procedure to wrap the queries. – linhx Mar 16 '21 at 04:02
  • https://www.mysqltutorial.org/mysql-transaction.aspx/ – Erk Mar 16 '21 at 04:14
  • @linhx So you are suggesting using pessimistic locking and limit each query result to 1 for row locking only? Will it work if each scheduler/worker query multiple tasks to execute? – Kaguya Mar 16 '21 at 04:16
  • @Kaguya - [SELECT FOR UPDATE](https://dev.mysql.com/doc/refman/5.7/en/select.html) has this this behaviour .. _If you use FOR UPDATE with a storage engine that uses page or row locks, rows examined by the query are write-locked until the end of the current transaction._ SO first in gets the next element .. – Mr R Mar 16 '21 at 04:34
0

You can just work around the atomicity or transaction issue like this,

Using the id of your task, assuming it's incremental. If you have three machines runing the task scheduling. Then just mod the id by three and assign the tasks with result 0, 1 ,2 to a fixed machine. So different machines wont' interfere with each other (or race condition)

AwesomeHunter
  • 690
  • 6
  • 10
  • Quite a smart idea. Your approach will work when we assume every parallel machine has been protected by high availabilities feature. It also serves like a load balancer for task handling. – Kaguya Mar 16 '21 at 05:50