2

We have data streaming in on an irregular basis and in quantities that I can not predict. I currently have the commit-interval set to 1 because we want data to be written as soon as we receive it. We sometimes get large numbers of items at a time (~1000-50000 items in a second) which I would like to commit in larger chunks as it takes awhile to write these individually. Is there way to set a timeout on the commit-interval?

Goal: We set the commit-interval to 10000, we get 9900 items and after 1 second it commits the 9900 items rather then waiting until it receives 100 more.

Currently, when we set the commit-interval greater than 1, we just see data waiting to be written until it hits the amount specified by the commit-interval.

ergometer
  • 21
  • 5
  • take a look into http://stackoverflow.com/q/37390602/62201, my first thought for your usecase is about https://en.m.wikipedia.org/wiki/Log_rotation and persisting afterwards – Michael Pralow Jun 09 '16 at 05:20

1 Answers1

0

How is your data streaming in? Is it being loaded to a work table? Added to a queue? Typically you'd just drain the work table or queue with whatever commit interval performs best then re-run the job periodically to check if a new batch of inbound records has been received.

Either way, I would typically leverage flow control to have your job loop and just process as many records as are ready to be processed for a given time interval:

<job id="job">
    <decision id="decision" decider="decider">
        <next on="PROCESS" to="processStep" />
        <next on="DECIDE" to="decision" />
        <end on="COMPLETED" />
        <fail on="*" />
    </decision>

    <step id="processStep">
        <!-- your step here -->
    </step>

</job>

<beans:bean id="decider" class="com.package.MyDecider"/>

Then your decider would do something like this:

if (maxTimeReached) {
    return END;
}

if (hasRecords) {
    return PROCESS;
} else {
    wait X seconds;
    return DECIDE;
}
Dean Clark
  • 3,770
  • 1
  • 11
  • 26
  • Currently we receive several tar balls a minute (file1.tar.gz) and each tar ball contains 100s-1000s of items. Each item in the tar ball launches a new job. Some of the files in the tar ball are unnecessary and are not commited to the database, but we cannot tell which until we have opened them. There are also cases where we go for hours without receiving new files. – ergometer Jun 13 '16 at 16:15
  • Then your "processStep" step here would be a partitioned step that picks up each of the files that just landed. Or you could add a small tasklet step before that that unpacks the tar balls before your partitioned step. Either way, a decider will work nicely to peak in your folder to see if any files are waiting to be processed. – Dean Clark Jun 13 '16 at 17:59