1

I have a scenario with the following steps:

  1. Using mssql jdbc driver I need to connect to a Sql Server database table A and get several millions of rows.

  2. I need to perform several processing and parsing from the table data and also data coming from other sources (e.g. web services).

  3. I have to write the processed data into target tables B, C, D, E, F, G on a different Sql Server database. The writes could (should?) be done in parallel.

I would like to ask some advice on how correctly handle point 3. I guess is a bad idea to provide the same connection to different threads to make the parallel writes to the target tables. My general idea is to spawn a new thread for each target table (6 in this case) and create a different jdbc connection for each table, thus in theory each write can be done in parallel and independent from each other.

Will this work? Suggestions for other/better ways?

revy
  • 3,945
  • 7
  • 40
  • 85

2 Answers2

2

My general idea is to spawn a new thread for each target table (6 in this case) and create a different jdbc connection for each table, thus in theory each write can be done in parallel and independent from each other.

Certainly sounds like a good plan to me. I would use a connection pool such as HikariCP or DBCP to maintain multiple connections to your database server. Then you can add multiple threads and each can request a connection and afterwards return it to the pool to be used later.

Will this work? Suggestions for other/better ways?

It will work. One thing to consider is that 6 might not be the right number. Your server may not have the bandwidth to handle that much data at once so you might want to consider decreasing the number of threads in your pool until you find the optimal number that will give you the most bandwidth. That said, if there are 6 tables then 6 may indeed be the right number depending on how the data is partitioned on the server.

Depending on how knowledgeable you are about threads, you should check out the docs on thread pooling.

Gray
  • 115,027
  • 24
  • 293
  • 354
  • Thank you for the answer. I was thinking to use a thread pool with ExecutorService in which I would create long lived threads, each one responsible to open a connection to SqlServer and close it at the end of the application. Is this sufficient or do I need also the connection pool other than the thread pool? Also I have some doubts about how to pass different data to each thread at each iteration of ResultSet.next(), since the data is retrieved by the main thread from the source table. Is possible to pass custom data to a thread using ExecutorService ? – revy Feb 21 '19 at 14:49
  • 1
    Yes using a thread pool with a long running connection is fine. You can still use a connection pool to help manage the connections but you are right that it's not necessary. – Gray Feb 21 '19 at 15:10
  • 1
    You can certainly pass custom data to an executor service since you can pass a `Runnable` or `Callable` that you define. The trick is if the main method needs to pass the data for table `B` through a different channel than table `C`. Then you can't use an `ExecutorService` in that manner. Then maybe you need to have 6 `BlockingQueue`s, one for each table, and main will write rows down each of the appropriate queues and on the other side the table threads would pluck off the row and send it to their table that they are working with. – Gray Feb 21 '19 at 15:12
  • In my scenario the main thread is the data producer while the threads in the spool are the consumers. The idea is to use a PreparedStatement with a predefined batch size (says 100 rows), thus each time 100 rows of data are available (for some table) the main thread should take a thread from the ExecutorService pool and pass the PreparedStatement to that thread which will execute the write. Is this feasible? – revy Feb 21 '19 at 15:22
  • The main thread does "take a thread" but you can pass the 100 rows as a `Runnable` or `Callable` as a `executorService.submit(...)` and the next thread will pull off the rows and work on them. Again, the problem is when a certain 100 rows need to go to a certain table. If the 100 rows can go to _any_ of the B,C,D,... tables then this is not a problem. – Gray Feb 21 '19 at 15:58
1

I've implemented the following solution which uses the producer/consumer pattern using BlockingQueue and ExecutorService. The main thread (producer) instantiates a BlockingQueue for each of the worker threads (consumers) and a boolean volatile variable "terminated" to signal to the working threads when all data has been generated and they should terminate execution (escaping from the while loop, empty the queue and write the remaining data on jdbc connection). The producer produces different data for each thread using the two BlockingQueue blockingQueue1 and blockingQueue2.

Here is the simplified MainThreadProducer, which simply generates integer data for two worker threads:

// MainThreadProducer.java

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

public class MainThreadProducer {

    public static Logger logger = LogManager.getLogger(MainThreadProducer.class);
    public final static BlockingQueue<Integer> blockingQueue1 = new LinkedBlockingDeque<>(100);
    public final static BlockingQueue<Integer> blockingQueue2 = new LinkedBlockingDeque<>(100);

    /* signal to the worker threads that all data has been generated */
    public static volatile boolean terminated = false;


    private void run () {

        try {

            ExecutorService executor = Executors.newFixedThreadPool(2);
            Future<Integer> future1 = executor.submit(new WorkerThreadConsumer("1"));
            Future<Integer> future2 = executor.submit(new WorkerThreadConsumer("2"));

            for (int i = 0; i < 10023; ++i) {

                blockingQueue1.put(i);
                blockingQueue2.put(i*2);

            }

            executor.shutdown();
            terminated = true;

            int res1 = future1.get();
            int res2 = future1.get();

            logger.info("Total rows written (thread 1): " + res1);
            logger.info("Total rows written (thread 2): " + res2);            

        }
        catch (Exception e) {

            e.printStackTrace();
            System.exit(1);

        }

    }

    public static void main(String[] args) {

        MainThreadProducer instance = new MainThreadProducer();
        instance.run();

    }

}

Here is the WorkerThreadConsumer.java class. For this test I am creating two threads which will write to a database DBTEST on table TARGET_1 and TARGET_2 respectively. Each thread is instantiated with a specific String type (1 and 2), thus it can know from which BlockingQueue it needs to read data.

// WorkerThreadConsumer.java

import java.sql.PreparedStatement;
import com.microsoft.sqlserver.jdbc.SQLServerResultSet;
import java.sql.*;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.Callable;
import Configuration;


public class WorkerThreadConsumer implements Callable<Integer> {

    private String type;

    public WorkerThreadConsumer (String type) {

        this.type = type;        

    }

    @Override
    public Integer call() {        

        String TAG = "[THREAD_" + Thread.currentThread().getId() + "]";
        int processed = 0; // number of rows currently processed
        int batchSize = 100; // size of the batch we write to the server with the PreparedStatement

        try {
            // load jdbc driver
            Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
            MainThreadProducer.logger.info(TAG + "\tLoaded com.microsoft.sqlserver.jdbc.SQLServerDriver");

            String stub = String.format("INSERT INTO DBTEST.dbo.TARGET_%s (id) VALUES (?);", this.type);

            BlockingQueue<Integer> queue;

            switch (this.type) {
                case "1":
                    queue = MainThreadProducer.blockingQueue1;
                    break;

                case "2":
                    queue = MainThreadProducer.blockingQueue2;
                    break;

                default:
                    queue = MainThreadProducer.blockingQueue1;
            }

            try (Connection connection = DriverManager.getConnection(Configuration.DWH_DB_CONNECTION_URL);
                 PreparedStatement stmt = connection.prepareStatement(stub);) {

                connection.setAutoCommit(false);

                while (!MainThreadProducer.terminated) {

                    int data = queue.take();
                    stmt.setInt(1, data);
                    stmt.addBatch();
                    processed += 1;

                    if (processed % batchSize == 0) {
                        int[] result = stmt.executeBatch();
                        connection.commit();
                        MainThreadProducer.logger.info(TAG + "\tWritten rows count: " + result.length);
                    }

                }

                // empty queue and write
                while (!queue.isEmpty()) {
                    int data = queue.take();
                    stmt.setInt(1, data);
                    stmt.addBatch();
                    processed += 1;

                    if (processed % batchSize == 0) {
                        int[] result = stmt.executeBatch();
                        connection.commit();
                        MainThreadProducer.logger.info(TAG + "\tWritten rows count: " + result.length);
                    }
                }

                // last write in case queue size > batch size
                int[] result = stmt.executeBatch();
                connection.commit();
                MainThreadProducer.logger.info(TAG + "\tWritten rows count: " + result.length);


            }

        }
        catch (Exception e) {

            e.printStackTrace();
            System.exit(1);

        }

        return processed;
    }



}

The solution seems to work. Please let me know if you see potential issues.

revy
  • 3,945
  • 7
  • 40
  • 85