Executor framework to process 1 million records

Question

I had a requirement where I had to process a file containing 1 million records and save it in a redis cache. I was supposed to use redis pipeline but I didn't get any information on it. Here was my question: Question

So I decided to use multithreading-executor framework. I am new to multithreading Here is my code:

@Async
    public void createSubscribersAsync(Subscription subscription, MultipartFile file)throws EntityNotFoundException, InterruptedException, ExecutionException, TimeoutException {

        ExecutorService executorService = Executors.newFixedThreadPool(8);
        Collection<Callable<String>> callables = new ArrayList<>();


        List<Subscriber> cache = new ArrayList<>();
        int batchSize = defaultBatchSize.intValue();

        while ((line = br.readLine()) != null) {
            try {
                Subscriber subscriber = createSubscriber(subscription, line);
                cache.add(subscriber);
                if (cache.size() >= batchSize) {
                    IntStream.rangeClosed(1, 8).forEach(i -> {
                    callables.add(createCallable(cache, subscription.getSubscriptionId()));});
                }
            } catch (InvalidSubscriberDataException e) {
                invalidRows.add(line + ":" + e.getMessage());
                invalidCount++;
            }
        }
        List<Future<String>> taskFutureList = executorService.invokeAll(callables);
        for (Future<String> future : taskFutureList) {
            String value = future.get(4, TimeUnit.SECONDS);
            System.out.println(String.format("TaskFuture returned value %s", value));
        }
    }

    private Callable<String> createCallable(List<Subscriber> cache, String subscriptionId) {

        return new Callable<String>() {

            public String call() throws Exception {

                System.out.println(String.format("starting expensive task thread %s", Thread.currentThread().getName()));
                processSubscribers(cache,subscriptionId);
                System.out.println(String.format("finished expensive task thread %s", Thread.currentThread().getName()));
                return "Finish Thread:" + Thread.currentThread().getName();
            }
        };
    }

    private void processSubscribers(List<Subscriber> cache, String subscriptionId) {
        subscriberRedisRepository.saveAll(cache);
        cache.clear();
    }

Idea here is I want to split a file in a batch and save that batch using a thread. I created the pool of 8 threads.

Is this a correct way to implement executor framework? If not could you please help me out in this? Appreciate the help.

"Is this a correct way to implement executor framework?" You tell us. Does it work? If so, it's correct. — Michael, Sep 24 '19 at 12:09
This will not work as you intend, it's are passing the same list into 8 callables while modifying the list. Not sure what the batchSize is meant to achieve, your cache once it reaches batchSize, never shrinks, adding another subscriber creates another batch. — Martin'sRun, Sep 24 '19 at 12:19
@Firefly Oops..Same list to 8 callables? That's not I wanted. Actually what I want is parallel processing of 1 million records file. Here batch is of 1k. I want it in such a way that each thread will process new 1k records so as to improve the performance. How should I modify my code to achieve that? — pan1490, Sep 24 '19 at 12:29
Is there any particular reason to choose 8 as the size for your threadpool? That seems like an oddly specific number. — Martin'sRun, Sep 24 '19 at 14:14
@Martin'sRun No as such. My main intent is all threads should divide their work and it should improve the performance. Right now it's taking more than an hour to process 1 million records. — pan1490, Sep 25 '19 at 05:42
Check if the answer works for you, can change batch size or pool size depending on need. — Martin'sRun, Sep 25 '19 at 08:04

Martin'sRun · Accepted Answer · 2019-09-24T14:26:07.593

0

Quick modifications to your current code to achive the ask:

In your while loop once the current cache exceeds batch size, create a callable passing in the current cache. Reset the cache list, create a new list and assign it as cache.

You are creating a list of callables to submit them as a batch, why not submit your callables right after creating them? This will start writing already read records to redis, while your main thread continues reading from file.

 List<Future<String>> taskFutureList = new LinkedList<Future<String>>();
 while ((line = br.readLine()) != null) {
    try {
        Subscriber subscriber = createSubscriber(subscription, line);
        cache.add(subscriber);
        if (cache.size() >= batchSize) {
                    taskFutureList.add(executorService.submit(createCallable(cache,subscription.getSubscriptionId())));
            List<Subscriber> cache = new ArrayList<>();
        }
     } catch (InvalidSubscriberDataException e) {
        invalidRows.add(line + ":" + e.getMessage());
        invalidCount++;
    }
}
//submit last batch that could be < batchSize
if(!cache.isEmpty()){ 
           taskFutureList.add(executorService.submit(createCallable(cache,subscription.getSubscriptionId())));
}

You do not have to store a seperate list of callables.

edited Sep 24 '19 at 14:26

answered Sep 24 '19 at 14:02

Martin'sRun

522
3
11

I ran this. I found that once all threads (8 threads)are finished their work for 1st batch they are not repeating. I mean if thread has finished the work it should take next batch of 1000 subscribers and it should continue until all 1 million subscribers are processed which is not happening right now – pan1490 Sep 25 '19 at 09:40
@pan1490 Are you doing "future.get()" over all futures in taskFutureList as in the original code? – Martin'sRun Sep 25 '19 at 09:46
Yes, for all futures calling "future.get()" – pan1490 Sep 25 '19 at 09:51
1

You should remove the future.get timeout if intending to wait for threads to complete running, could you make it future.get() instead of future.get(4, TimeUnit.SECONDS)? Unless you don't want to wait for threads to complete, that is. – Martin'sRun Sep 25 '19 at 10:10
Sure. I will give a try – pan1490 Sep 25 '19 at 10:11
I could see some weird behavior. For 1 million records around 9,70,000 records are getting processed in a minute or two but then after that for remaining 30k records it is taking more than 50 minutes – pan1490 Sep 26 '19 at 07:45
Appreciate your response. Accepting your answer. – pan1490 Nov 07 '19 at 06:40

Executor framework to process 1 million records

1 Answers1