Stream a database recordset to multiple thread workers

Question

I have a process which requires streaming data from a database and passing the records off to an external server for processing before returning the results to store back in the database.

Get database row from table A
Hand off to external server
Receive result
insert database row into table B

Currently this is a single-threaded operation, and the bottleneck is the external server process and so I would like to improve performance by using other instances of the external server process to handle requests.

Get 100 database rows from table A   
  For each row
    Hand off to external server 1
    Receive Result
    insert database row into table B 
In parallel get 100 database rows from table A
  For each row
    Hand off to external server 2
    Receive Result
    insert database row into table B

Problem 1 I have been investigating Java thread pools, and dispatching records to the external servers this way, however I'm not sure how to fetch records from the database as quickly as possible without the workers rejecting new tasks. Can this be done with thread pools? What architecture should be used to achieve this?

Problem 2 At present I have optimised the database inserts by using batch statements and only executing once 2000 records have been processed. Would it be possible to adopt a similar approach within the workers?

Any help in structuring a solution to this problem would be greatly appreciated.

Could you please explain "I'm not sure how to fetch records from the database as quickly as possible without the workers rejecting new tasks."? Is the thread pool rejecting your tasks because you generate so many tasks quickly that the pool's waiting queue is full? If so, you can make each task's batch size bigger. (e.g. 1000 rows instead of 100 rows) — James, Jun 29 '14 at 00:04
I could increase the size of the queues but given I will have a fixed number of workers I may reach the situation where the queues still fill up. The only way I can think to get around this would be to tune the program to pause the creation of new tasks for a period of n milliseconds by which point the workers should have completed enough of their queues. — eageranalyst, Jun 29 '14 at 09:12

James · Answer 1 · 2014-06-30T00:07:52.933

Based on your comments, I think the key point is controlling the count of pending tasks. You have several options:

Do an estimate on the number of records in your data set. Then, decide on a batch size that will produce a reasonable number of tasks. For example, if you want to limit pending task count to 100. Then if you have 100K records, you can have a batch size of 1K. If you have 1Mil records, then set batch size to 10K.
Supply your own bounded BlockingQueue to the threadpool. If you haven't done it before, you probably should study the java.util.concurrent package carefully before doing this.
Or you can use a java.util.concurrent.Semaphore, which is a simpler facility than a user supplied queue:
- Declare a semaphore with your pending task count limit
Semaphore mySemaphore = new Semaphore(max_pending_task_count);
- Since your task generation is fast, you can use a single thread to generate all tasks. In your task generating thread:


    while(hasMoreTasks()) {
        // this will block if you've reached the count limit
        mySemaphore.acquire(); 
        // generate a new task only after acquire
        // The new task must have a reference to the Semaphore
        Task task = new Task(..., mySemaphore);
        threadpool.submit(task);
    }
    // now that you've generated all tasks,
    // time to wait for them to finish. 
    // you may have a better way to detect that, however
    while(mySemaphore.availablePermits() < max_pending_task_count) {
        Thread.sleep(some_time);
    }
    // now, go ahead dealing with the results

In your Task thread:

    
    public void run() {
        ...
        // when finished, do a release which increases the permit 
        // by 1 and inform your task generator thread to produce 1 more task
        mySemaphore.release();
    }

Stream a database recordset to multiple thread workers

1 Answers1