0

I have a spring boot application, with SQL server as database. One of the services that the app offer is a function who will insert data in a table (staging database). For more clarity, I will describe the solution before and after the problem.

Before :

The injection will be done using spark bulk inject, so everything was managed by spark, I needed only to add the connexion configuration for spark. Everything was fine using this solution, the insert took about 30 min for 7M row.

After:

We need to add the encryption for some columns (always encryption) in the table. so the bulk inject wasn't a possible solution anymore.

The new solution was to keep using spark but the insertion itself will be done using statement, the idea is to repartition the spark RDD (the data that will be inserted) in a small partition, each partition will be injected using a batch statement.

For example :

I have a file with 60K rows, I want to insert 1000 each time, I will repartition the input into 6K partition and each partition will be injected in a batch.

Below the code who will manage the insertion:

    @Override
    public void call(Iterator<String> partition) {

        try {
            PreparedStatement preparedStatement = jdbcConnection.getConnection().prepareStatement(compiledQuery.toString());
            jdbcConnection.getConnection().setAutoCommit(false);
            long start = System.currentTimeMillis();
            while (partition.hasNext()) {

                String row = partition.next();
                Object[] rowValidated = controlRow(row);

                for (int index = 0; index < rowValidated.length; index++) {

                    PremiumTableFormat refColumn = getElementByIndexFromFixedSchema(index);
                    if (refColumn.isEncrypted()) {

                        if (!refColumn.getType().equalsIgnoreCase("date")) {
                            preparedStatement.setObject(index + 1, rowValidated[index], java.sql.Types.NVARCHAR, 264);
                        } else {
                            preparedStatement.setDate(index + 1, (Date) rowValidated[index]);
                        }
                    } else {
                        preparedStatement.setObject(index + 1, rowValidated[index]);
                    }
                }
                preparedStatement.addBatch();
            }
            long end = System.currentTimeMillis();
            log.info("prepare batch data take = " + ((end - start)) + " ms");

            start = System.currentTimeMillis();
            preparedStatement.executeBatch();
            jdbcConnection.getConnection().commit();
            preparedStatement.clearBatch();
            end = System.currentTimeMillis();
            log.info("total time taken to insert the batch = " + ((end - start)) + " ms");

        } catch (SQLServerException e) {
            e.printStackTrace();
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
    }

The problem is the executeBatch() take too much time, 30min for 60K rows.

I used different batch size, 10, 50, 10, 1000 and 5000 row the small number work better but still slow (best time 28 min for 60K using 50 rows)

Please note :

  • controlRow (row) don't take too much 20ms for each partition with 1000 rows.
  • When the connection string doesn't contain always encrypted params, the insert is better (4.7 min)
  • connexion string with always encytpted "jdbc:sqlserver://" + parameters.server + ";encrypt=true;databaseName=" + parameters.getDatabase() + ";encrypt=true;trustServerCertificate=true;columnEncryptionSetting=Enabled;"

Do you think that something is missing or not in the corrected way?

Thank you in adavnce.

M-BNCH
  • 393
  • 1
  • 3
  • 18
  • This is by no means an answer - but if I were in your position I'd try just running the insert statement (just `.execute()`, not`.addBatch()`), and commit every ~1000 inserts or so (committing too often _and_ creating too-large transactions both can be performance issues, so, split the difference, commit every 1000). – rzwitserloot Apr 28 '23 at 12:26
  • you could run each batch in own thread. – siggemannen Apr 28 '23 at 12:27
  • The thing is, JDBC's various peculiarities, such as `addBatch`, exist _because_ some DB engines need code structured that way to do a thing unique to them or provide the best performance. However, that doesn't mean all JDBC drivers are _required_ to be more performant. They just need to implement it. As a consequence, sometimes just rewriting to another JDBC 'way' solves the problem. We don't have your test setup to test any of this. – rzwitserloot Apr 28 '23 at 12:27
  • @siggemannen Answering 'my code that pretty much _just_ interacts with the DB is running slowly' with: "Did you try threading?" is.. not appropriate advice, to say the least. That never works. – rzwitserloot Apr 28 '23 at 12:28
  • In general trying to answer performance questions with wild stabs in the dark is bad. Please don't do that. – rzwitserloot Apr 28 '23 at 12:28
  • Lol, the comments ARE for wild stabs in the dark @rzwitserloot, but thanks for playing. Obviously it might help otherwise there would be no point of using multi-threaded built-in in bulk imports etc. – siggemannen Apr 28 '23 at 12:31
  • @OP, another thing you can speed up your code is using raw sql + VALUES construct, batched up 1000 rows or so. The jdbc binding methods are usually always slower. Of course, i don't know if always encrypted thing supports that, but maybe rzwitserloot does – siggemannen Apr 28 '23 at 12:32
  • 2
    So have you tried to race your horses using this code without Always Encrypted? I don't think it would be any/much faster because although you're sending a single batch with 1,000 insert statements it's still inserting RBAR (Row-By-Agonizing Row), not a single insert with 1,000 rows. Batching only serves to reduce communication overhead (round-tripping between client and server), it doesn't make the statements execute any faster on the server. – AlwaysLearning Apr 28 '23 at 13:58
  • I tested execute() using 1000 rows with always encrypted on it took 52 min without always encrypted 22 min so, using addBatch() is still better @siggemannen usinf spark I'm already in parallel execution – M-BNCH Apr 28 '23 at 17:48
  • 1
    How many columns is it? And are you really in parallell? jdbcConnection.getConnection looks suspiciously like shared connection to me, but i don't know anything about spark – siggemannen Apr 28 '23 at 20:07
  • There's 241 columns. Yes, you're right it was a shared connexion. While spark manages the parallel execution, all the process was using a single connexion. By fixing this, the execution time was improved (now about 4 min for 60K/ 44min for 1M) but still not like before. – M-BNCH Apr 29 '23 at 17:26
  • That's not bad of an improvement! But i'm still not sure why you can't use bulk import: https://learn.microsoft.com/en-us/sql/connect/jdbc/using-bulk-copy-with-the-jdbc-driver?view=sql-server-ver16#bulk-copy-with-always-encrypted-columns – siggemannen May 02 '23 at 19:07
  • @siggemannen I will check this again, but from what I remember this is valid only to copy from a file / table to another table. – M-BNCH May 03 '23 at 12:26
  • update : I did test the new solution in the server, the performance now it's almost like before 32min for 7M rows. As a conclusion using the multiple connexion with spark (or any multi threads solution), give me a good result. I hope this will help someone and thanks all for your time and support. – M-BNCH May 03 '23 at 12:27
  • Excellent! Glad you gave an update – siggemannen May 03 '23 at 12:51

0 Answers0