0

I have huge list of reports loaded into chunk partition step. Each reports will be further processed to generate individual report. But if I load 50k of reports in the partition step, which overloads the server and it gets much slow. Instead of that I would prefer, partition step to load 3k of report list, process it and then load another 3k reports on parition step.. continue the same until 50k reports get processed.

    <step id="genReport" next="fileTransfer">
        <chunk  item-count="1000">
            <reader ref="Reader" >
            </reader>
            <writer
                ref="Writer" >
            </writer>
        </chunk>
      <partition>
            <mapper ref="Mapper">
                <properties >
                    <property name="threadCount" value="#{jobProperties['threadCount']}"/>
                    <property name="threadNumber" value="#{partitionPlan['threadNumber']}"/>
                </properties>
            </mapper>
      </partition>
    </step> 
public PartitionPlan mapPartitions() {
        PartitionPlanImpl partitionPlan = new PartitionPlanImpl();
        int numberOfPartitions = //dao call to load the reports count
        partitionPlan.setThreads(getThreadCount());
        partitionPlan.setPartitions(numberOfPartitions); //This numberOfPartitions is comes from the database, huge size like 20k to 40k
        Properties[] props = new Properties[numberOfPartitions];

        for (int idx = 0; idx < numberOfPartitions; idx++) {
            Properties threadProperties = new Properties();
            threadProperties.setProperty("threadNumber", idx + "");
            GAHReportListData gahRptListData = gahReportListManager.getPageToProcess(); //Data pulled from PriorityBlockingQueue 
            String dynSqlId = gahRptListData.getDynSqlId(); 

            threadProperties.setProperty("sqlId", dynSqlId);
            threadProperties.setProperty("outFile", fileName);

            props[idx] = threadProperties;
        }
        partitionPlan.setPartitionProperties(props);
        return partitionPlan;
    }

Once 3k reports of data processed by partition mapper, Then it has to check for the next available list. If its available the partition should be reset with next set of 3k reports to process.

user3540722
  • 175
  • 1
  • 2
  • 11
  • If I understood this right, you're launching 50,000 partitions this way. I'd suggest launching some fixed number of partitions (maybe match however many threads you decide on) and give each partition a list of reports to process through (a list of sqlIds I guess). All you're doing by having all these partitions is creating a lot of Java objects to manage it all. It isn't going to run any faster than the number of threads you have working through the list of reports. Not really an answer...just an opinion. – DFollis Jun 19 '19 at 19:09
  • Each sqlid's needs to be processed separately. So passing list to a thread may not be right-way in my case. Is there anyway to reset partition Mapper ? when the created partition is over, go ahead and create another partition mapper for the same step – user3540722 Jun 19 '19 at 20:21

1 Answers1

1

There's no way to reset the partition. When all the partitions defined by the partitionMapper are done, the step is over. You could have a second partitioned step that's just like the first one I guess (and a third, and a fourth) until you get through everything. That's messy. And you can't loop back in JSL and re-execute the same step again.

You could have a split/flow that ran multiples of these steps concurrently, but you can't dynamically set how many flows. That's in the JSL. And you'd end up with more concurrency that your environment could probably handle.

I assume your chunk reader/processor/writer are iterating through the results of the one SQLid that is assigned to the partition now. To do a list of sqlids I guess you'd need a way to tell when one was finished and the next one started within the same chunk loop. The reader could probably manage the list and would know when the transitions happened. You'd probably need a signal to the writer that a chunk end was the end of one report and it should move onto the next one. You'd probably want a custom checkpoint algorithm for that so you could be sure to checkpoint at the end of a report rather than hope you hit a checkpoint when each sqlid ran out of records to process.

I'm putting this in as an answer instead of another comment because it seems the answer to the question asked here is 'no'. The rest is just discussion about possible alternative approaches.

DFollis
  • 419
  • 2
  • 5