2

I have a Camel route that reads a file from S3 and the processes the input file as follows:

  1. Parse each row into a POJO (Student) using Bindy
  2. Split the output by body()
  3. Aggregate by an attribute of the the body (.semester) and a batch size of 2
  4. Invoke the persistence service to upload to DB in given batches

The problem is that with a batch size of 2 and an odd number of records, there is always one record that does not get saved.

Code provided is Kotlin but should not be very different from equivalent Java code (bar the slash in front of "\${simple expression}" or the lack of semicolons to terminate statements.

If I set the batch size to 1 then every record is saved, otherwise the last record never gets saved.

I have checked the documentation for message-processor a few times but it doesn't seem to cover this particular scenario.

I have also set [completionTimeout|completionInterval] in addition to completionSize but it does not make any difference.

Has anyone encountered this problem before?

val csvDataFormat = BindyCsvDataFormat(Student::class.java)

from("aws-s3://$student-12-bucket?amazonS3Client=#amazonS3&delay=5000")
    .log("A new Student input file has been received in S3: '\${header.CamelAwsS3BucketName}/\${header.CamelAwsS3Key}'")
    .to("direct:move-input-s3-object-to-in-progress")
    .to("direct:process-s3-file")
    .to("direct:move-input-s3-object-to-completed")
    .end()

from("direct:process-s3-file")
    .unmarshal(csvDataFormat)
    .split(body())
    .streaming()
    .parallelProcessing()
    .aggregate(simple("\${body.semester}"), GroupedBodyAggregationStrategy())
    .completionSize(2)
    .bean(persistenceService)
    .end()

With an input CSV file including seven (7) records, this is the output generated (with some added debug logging):

WARN 19540 --- [student-12-move] c.a.s.s.internal.S3AbortableInputStream  : Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
 INFO 19540 --- [student-12-move] student-workflow-main                    : A new Student input file has been received in S3: 'student-12-bucket/inbox/foo.csv'
 INFO 19540 --- [student-12-move] move-input-s3-object-to-in-progress      : Moving S3 file 'inbox/foo.csv' to 'in-progress' folder...
 INFO 19540 --- [student-12-move] student-workflow-main                    : Moved input S3 file 'in-progress/foo.csv' to 'in-progress' folder...
 INFO 19540 --- [student-12-move] pre-process-s3-file-records              : Start saving to database...
DEBUG 19540 --- [read #7 - Split] c.b.i.d.s.StudentPersistenceServiceImpl  : Saving record to database: Student(id=7, name=Student 7, semester=2nd, javaMarks=25)
DEBUG 19540 --- [read #7 - Split] c.b.i.d.s.StudentPersistenceServiceImpl  : Saving record to database: Student(id=5, name=Student 5, semester=2nd, javaMarks=81)
DEBUG 19540 --- [read #3 - Split] c.b.i.d.s.StudentPersistenceServiceImpl  : Saving record to database: Student(id=6, name=Student 6, semester=1st, javaMarks=15)
DEBUG 19540 --- [read #3 - Split] c.b.i.d.s.StudentPersistenceServiceImpl  : Saving record to database: Student(id=2, name=Student 2, semester=1st, javaMarks=62)
DEBUG 19540 --- [read #2 - Split] c.b.i.d.s.StudentPersistenceServiceImpl  : Saving record to database: Student(id=3, name=Student 3, semester=2nd, javaMarks=72)
DEBUG 19540 --- [read #2 - Split] c.b.i.d.s.StudentPersistenceServiceImpl  : Saving record to database: Student(id=1, name=Student 1, semester=2nd, javaMarks=87)
 INFO 19540 --- [student-12-move] device-group-workflow-main               : End pre-processing S3 CSV file records...
 INFO 19540 --- [student-12-move] move-input-s3-object-to-completed        : Moving S3 file 'in-progress/foo.csv' to 'completed' folder...
 INFO 19540 --- [student-12-move] device-group-workflow-main               : Moved S3 file 'in-progress/foo.csv' to 'completed' folder...
Lex Luthor
  • 523
  • 6
  • 18
  • The completionTimeout ought to trigger the last row when it times out. It would be strange if that didnt work. – Claus Ibsen Jan 03 '19 at 12:27
  • It does exhibit the right behaviour if I replace simple("${body.semester}") for constant(true). It is probably a bug... – Lex Luthor Jan 05 '19 at 01:20
  • What version of Camel are you using? The group key whether body.semester or constant should not affect the timeout. – Claus Ibsen Jan 05 '19 at 07:28
  • I am using the following components: + camel.version = 2.23.0 + spring-boot.version = 2.1.1.RELEASE + kotlin.version = 1.3.10 + aws-java-sdk.version = 1.11.461 – Lex Luthor Jan 05 '19 at 23:06
  • And you are sure there is not something wrong in your bean that doesnt work with 1 record only. Have you tried adding a log after the aggregate to see that it logs something when the completion timeout is triggered etc. And if still a problem, you can try to build a project on github for others to easier take a look at. And make it possible to run easily without an AWS account etc. – Claus Ibsen Jan 07 '19 at 20:20
  • @LexLuthor were you able to fix this? I am facing similar issue. – Sarang Dec 02 '20 at 12:09
  • I ended up converting my processing logic into a Reactive Flow using Project Reactor library. I then subscribe to it from the top-level Camel route. clean approach, very happy with results. – Lex Luthor Dec 07 '20 at 22:41

1 Answers1

1

If you need to immediately complete your message, then you can specify a completion predicate which is based on the exchange properties set by the splitter. I've not tried this, but I think

.completionPredicate( simple( "${exchangeProperty.CamelSplitComplete}" ) )

would process the last message.

My other concern is that you've set parallelProcessing in your splitter, which may mean that the messages aren't processed in order. Is it really the splitter you want the parallel processing applied to, or actually the aggregator? You don't seem to do anything with the split records except aggregate them, then then process them, so it might be better to move the parallelProcessing instruction to the aggregator.

Screwtape
  • 1,337
  • 2
  • 12
  • 27
  • That did not solve it but it helped me understand the cause of the problem. The issue appears to be that when grouped by a certain attribute of the body there may be some carry over exchanges at the end for which none of the completion conditions will ever be true: * exchange completions size == 2 * exchangeProperty.CamelSplitComplete == true Not sure how to solve it yet.. – Lex Luthor Jan 03 '19 at 22:52
  • Maybe turn off parallel processing as you can have out of order processing with it enabled. – Claus Ibsen Jan 06 '19 at 09:51
  • I did turn off parallel processing during split as suggested. end result was the same though.. – Lex Luthor Jan 06 '19 at 22:09