I am using BatchedColumnProcessor of univocity parser to parse large CSV files. My parser settings are :
csvParserSettings.detectFormatAutomatically();
csvParserSettings.setHeaderExtractionEnabled(true);
csvParserSettings.setMaxCharsPerColumn(-1);
csvParserSettings.setColumnReorderingEnabled(true);
final RecLoCSVBatchedProcessor processor =
new RecLoCSVBatchedProcessor(batchSize, csvAccountId);
csvParserSettings.setProcessor(processor);
Code snippet where I call parsing is:
try (InputStream inputStream = new FileInputStream(csvLocalFilePath);
BOMInputStream bomInputStream = new BOMInputStream(inputStream);
Reader inputReader = new InputStreamReader(bomInputStream, "UTF-8")) {
// Rows are processed in batches by RecLoCSVBatchedProcessor
List<String[]> rows = csvProcessor.parseAll(inputReader);
} catch (final IOException e) {
}
Sometimes I have noticed the same batch is getting processed twice. In the processor I have overridden batchProcessed call and do the required processing.
public class RecLoCSVBatchedProcessor extends BatchedColumnProcessor {
public RecLoCSVBatchedProcessor(final int rowsPerBatch, final Long accountId) {
super(rowsPerBatch);
....
}
@Override
public void batchProcessed(final int rowsInThisBatch) { ... }
Is this something to do with the settings? As I have mentioned this does not happen always. It is such a waste a resources and unnecessary processing time when same batches get processed multiple times. Please let me know what could be wrong here.
Thanks