I have a spring boot application, with SQL server as database. One of the services that the app offer is a function who will insert data in a table (staging database). For more clarity, I will describe the solution before and after the problem.
Before :
The injection will be done using spark bulk inject, so everything was managed by spark, I needed only to add the connexion configuration for spark. Everything was fine using this solution, the insert took about 30 min for 7M row.
After:
We need to add the encryption for some columns (always encryption) in the table. so the bulk inject wasn't a possible solution anymore.
The new solution was to keep using spark but the insertion itself will be done using statement, the idea is to repartition the spark RDD (the data that will be inserted) in a small partition, each partition will be injected using a batch statement.
For example :
I have a file with 60K rows, I want to insert 1000 each time, I will repartition the input into 6K partition and each partition will be injected in a batch.
Below the code who will manage the insertion:
@Override
public void call(Iterator<String> partition) {
try {
PreparedStatement preparedStatement = jdbcConnection.getConnection().prepareStatement(compiledQuery.toString());
jdbcConnection.getConnection().setAutoCommit(false);
long start = System.currentTimeMillis();
while (partition.hasNext()) {
String row = partition.next();
Object[] rowValidated = controlRow(row);
for (int index = 0; index < rowValidated.length; index++) {
PremiumTableFormat refColumn = getElementByIndexFromFixedSchema(index);
if (refColumn.isEncrypted()) {
if (!refColumn.getType().equalsIgnoreCase("date")) {
preparedStatement.setObject(index + 1, rowValidated[index], java.sql.Types.NVARCHAR, 264);
} else {
preparedStatement.setDate(index + 1, (Date) rowValidated[index]);
}
} else {
preparedStatement.setObject(index + 1, rowValidated[index]);
}
}
preparedStatement.addBatch();
}
long end = System.currentTimeMillis();
log.info("prepare batch data take = " + ((end - start)) + " ms");
start = System.currentTimeMillis();
preparedStatement.executeBatch();
jdbcConnection.getConnection().commit();
preparedStatement.clearBatch();
end = System.currentTimeMillis();
log.info("total time taken to insert the batch = " + ((end - start)) + " ms");
} catch (SQLServerException e) {
e.printStackTrace();
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
The problem is the executeBatch() take too much time, 30min for 60K rows.
I used different batch size, 10, 50, 10, 1000 and 5000 row the small number work better but still slow (best time 28 min for 60K using 50 rows)
Please note :
- controlRow (row) don't take too much 20ms for each partition with 1000 rows.
- When the connection string doesn't contain always encrypted params, the insert is better (4.7 min)
- connexion string with always encytpted
"jdbc:sqlserver://" + parameters.server + ";encrypt=true;databaseName=" + parameters.getDatabase() + ";encrypt=true;trustServerCertificate=true;columnEncryptionSetting=Enabled;"
Do you think that something is missing or not in the corrected way?
Thank you in adavnce.