Java Thread Worker Group performance vs Embedded Looping performance

Question

I was recently doing some java performance testing when I ran this test and was absolutely shocked. I wanted to test and see what kind of performance difference I would get by doing statistics in worker group threads... that is when I got this really surprising results.

Here goes the code of the test:

import org.joda.time.DateTime;
import org.joda.time.Interval;

import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.*;

/**
 * Created by siraj on 1/2/16.
 */
public class WorkerPoolTest {
    int SAMPLE_LIMIT = 1000;
    DecimalFormat df = new DecimalFormat("#.####");

    public static void main(String[] args){

        int nTestElements = 100000;

        System.out.println("\tLinear\t\t\tNon-Linear");
        for (int i = 0;i<25;i++){
//            System.out.println("Linear test " + (i+1));
            System.out.print((i + 1));
            new WorkerPoolTest(false, nTestElements, false);
//            System.out.println("Non-linear test " + (i+1));
            new WorkerPoolTest(true, nTestElements, false);
            System.out.println();
        }


        System.out.println("Done test");
    }

    WorkerPoolTest(boolean useWorkerThreads, int testLimit, boolean outPutSampleResults){
        DateTime start = new DateTime();
//        System.out.println(start);
        startWorkerThreads(useWorkerThreads, testLimit, outPutSampleResults);
        DateTime end = new DateTime();
//        System.out.println(end);
        System.out.print("\t " +
                df.format( ((double) (new Interval(start, end).toDurationMillis()) /1000) ) + "\t\t");
    }

    private void startWorkerThreads(boolean userWorkerThreads, int testLimit, boolean outPutSampleResults){
        ArrayList<WDataObject> data = new ArrayList<>();

        if (userWorkerThreads){

            try {
                // do fast test
                ExecutorService pool = Executors.newFixedThreadPool(6);
                int nSeries = 2;
                Set<Future<WDataObject>> set = new HashSet<>();
                for (int i = 1; i <= testLimit; i ++){
                    Callable worker = new Worker(i);
                    Future<WDataObject> future = pool.submit(worker);
                    set.add(future);
                }
                for (Future<WDataObject> wdo : set){
                        data.add(wdo.get());
                }
                Collections.sort(data);
                if (outPutSampleResults)
                    for (WDataObject ob: data)
                    {
                        System.out.println(ob.toString());
                    }
            } catch (InterruptedException | ExecutionException e) {
                e.printStackTrace();
            }
        }else{
            // do linear test.

            for (int i = 1; i <= testLimit; i ++){
                WDataObject ob = new WDataObject(i);
                for (int s = 1; s <= SAMPLE_LIMIT; s++){
                    ob.dataList.add((double)i / (double)s);
                }
                data.add(ob);
            }
            if (outPutSampleResults)
                for (WDataObject ob: data)
                {
                    System.out.println(ob.toString());
                }
        }
    }

    class Worker implements Callable{
        int i;
        Worker(int i){
            this.i = i;
        }

        @Override
        public WDataObject call() throws Exception {
            WDataObject ob = new WDataObject(i);
            for (int s = 1; s <= SAMPLE_LIMIT; s++){
                ob.dataList.add((double)i / (double)s);
            }
            return ob;
        }
    }

    class WDataObject implements Comparable<WDataObject>{
        private final int id;

        WDataObject(int id){
            this.id = id;
        }

        ArrayList<Double> dataList = new ArrayList<>();

        public Integer getID(){
            return id;
        }

        public int getId(){
            return id;
        }

        public String toString(){
            String result = "";
            for (double data: dataList) {
                result += df.format(data) + ",";
            }
            return result.substring(0, result.length()-1);
        }

        @Override
        public int compareTo(WDataObject o) {
            return getID().compareTo(o.getID());
        }
    }
}

And here goes a sample of the output of running this program...

    Linear      |   Non-Linear
1    45.735     |    15.043     
2    24.732     |    16.559     
3    15.666     |    17.553     
4    18.068     |    17.154     
5    16.446     |    19.036     
6    17.912     |    18.051     
7    16.093     |    17.618     
8    13.185     |    17.2       
9    19.961     |    26.235     
10   16.809     |    17.815     
11   15.809     |    18.098     
12   18.45      |    19.265

How could this be when the linear calculation model is using a single thread? Also, I ran this test and watched my system monitor and noticed that while running the single embedded loop that all of my computers cores where being used at maximum strength. What's going on here? Why is it that the linear calculation algorithm gets faster with subsequent iterations and why does it sometimes out preform the threaded nonlinear version of the same job?

This code sample uses Joda Time for time stamping.

Also, I am having a hard time putting tab spaces in with this editor, the results used tab spaces. You can see it in the code.

do you think that 1000 is a big number? is it enough to do "benchmarking"? — Sleiman Jneidi, Jan 03 '16 at 00:02
All these objects that you're creating must be garbage collected. Your non-linear sorts the results whereas the linear doesn't. You're measuring the time it takes to print the result on the screen, which is a pure IO, synchronized operation. Your toString() method creates thousands of string objects that need to be copied and collected... It's hard to know what you're trying to measure here, but there are so many unrelated things you're measuring that the benchmark is not reliable. — JB Nizet, Jan 03 '16 at 00:02
@JB Nizet, see the code I have posted, getting same results. I have removed the redundant parts. — 11thdimension, Jan 03 '16 at 01:13
@Sleiman Jneidi, he is submitting 100000 tasks, do you think that's good enough? Even if it were 1000 tasks, that would be good enough if it succeeded in keeping CPU 100% utilized for the entire duration. — 11thdimension, Jan 03 '16 at 01:16
I came across this problem reading in data, doing statistics and taking the resulting statistics and saving them. The calls to the toString methods and the usage of the print statements do not occur in the test. I was expecting to see the worker threads complete their operations significantly faster than the single loop arithmetic. But that wasn't the case... not to mention that the first loop is always a lot slower than any of the others. Also, when I ran the single arithmetic loop, all 8 of my cores where at max load... not what I expected from a single thread loop. — Siraj, Jan 03 '16 at 01:36

score 1 · Accepted Answer · answered Jan 03 '16 at 03:52

What your test really measures is ... Object Allocation performance.

Every time you do an ob.dataList.add((double) i / (double) s);, you are auto-boxing, and you're creating a new Double object. And because you are adding this to a list that escapes local scope, the HotSpot compiler cannot do stack allocation as an optimization. So it has to allocate on the heap, which is a relatively expensive operation, that requires some coordination between threads, so it reduces your multi-threading performance.

Step 1 to make your algorithm more real-world: replace your ArrayList<Double> dataList = new ArrayList<>(); with:

double[] dataList = new double[SAMPLE_LIMIT];

After that, your "non-linear" version outperforms the linear one consistently by a factor 2.

Secondly, a division is an incredibly cheap operation so in any case you're mainly measuring memory writes and your memory bus throughput is limited, no matter how many threads you use.

If you replace your current code with something like this:

double sum = 0;
for (int s = 1; s <= SAMPLE_LIMIT; s++) {
    sum += (double) i / (double) s;
}
ob.dataList[0] = sum;

then you'll find that your non-linear version outperforms your linear one by a factor 4 to 6, which is what you expect with a thread pool with a fixed size of 6.

That was a great observation, I knew that array lists are obviously slower than arrays, but what I had not done was thing that it would be better to just pass an array into the function instead creating an array list. This would give me the ability of having an array of arbitrary length while at the same time not taxing performance with the use of arraylists. Thank you for explaining the stack allocation. That was really a great explanation. — Siraj, Jan 03 '16 at 04:47
I'm still left wondering, however, as to why the linear loop appears to be running on multiple cores? Even after updating the source as you suggested the loop still out preformed the distributed approach because the loop appears to be running on multiple cores. Does java have some built in feature that distributes loops? After the test, and taking the average of the completion time, I found that the loop had .3% faster completion time. — Siraj, Jan 03 '16 at 14:18
@Erwin Boldwidt Shouldn't this `measures is ... Object Allocation performance` be true for single threaded and multithread programs both? How does parallel memory allocation becomes 3 to 4 times slower than sequential allocation? Heap allocation will be done in both the cases, so that doesn't explain the difference. Only thing that made a difference was autoboxing/unboxing. Also can you please provide a link to `allocate on the heap... that requires some coordination between threads` — 11thdimension, Jan 03 '16 at 22:05

11thdimension · Answer 2 · 2016-01-03T05:27:30.197

This is not an answer to your question but just the confirmation that I'm getting the same results.

I have removed the redundant code and measured the exact same code execution time in both cases. Results don't change in multiple runs and when the order is reversed which rules out the garbage collection or any OS related CPU spikes during tests.

Here's modified code.

import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;

public class TestPerf {
    int SAMPLE_LIMIT = 10000;
    DecimalFormat df = new DecimalFormat("#.####");
    public static final int TEST_COUNT = 10;

    public static void main(String[] args) {
        TestPerf main = new TestPerf();
        int nTestElements = 10000;

        System.out.println("\tLinear\t\t\t\tNon-Linear");
        for (int i = 0; i < TEST_COUNT; i++) {
            System.out.print((i + 1));
            main.startWorkerThreads(false, nTestElements, false);
            main.startWorkerThreads(true, nTestElements, false);
            System.out.println("");
        }

        System.out.println("Reversed tests");
        System.out.println("\tNon Linear\t\t\t\tLinear");
        for (int i = 0; i < TEST_COUNT; i++) {
            System.out.print((i + 1));
            main.startWorkerThreads(true, nTestElements, false);
            main.startWorkerThreads(false, nTestElements, false);
            System.out.println("");
        }

        System.out.println("Done test");
    }

    private void startWorkerThreads(boolean userWorkerThreads, int testLimit, boolean outPutSampleResults) {

        if (userWorkerThreads) {

            try {
                // do fast test
                ExecutorService pool = Executors.newFixedThreadPool(6);

                Set<Future<Long>> futureSet = new HashSet<Future<Long>>();

                for (int i = 1; i <= testLimit; i++) {
                    Callable<Long> worker = new Worker(i);
                    futureSet.add(pool.submit(worker));
                }

                pool.shutdown();
                pool.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);

                //checking futures after all have returned, don't want to wait on each
                long executionTime = 0;
                for(Future<Long> future : futureSet) {
                    executionTime += future.get();
                }

                System.out.printf("\tnon linear = %f\t", (executionTime / 1e9));
            } catch (InterruptedException e) {
                e.printStackTrace();
            } catch (ExecutionException e) {
                e.printStackTrace();
            }
        } else {
            // do linear test.

            long timeDelta = 0;
            for (int i = 1; i <= testLimit; i++) {
                long startTime = System.nanoTime();

                WDataObject ob = new WDataObject(i);
                for (int s = 1; s <= SAMPLE_LIMIT; s++) {
                    ob.dataList.add((double) i / (double) s);
                }               

                long endTime = System.nanoTime();
                timeDelta += (endTime - startTime);
            }
            System.out.printf("\tlinear = %f\t",(timeDelta / 1e9));
        }
    }

    class Worker implements Callable<Long> {
        int i;
        Worker(int i) {
            this.i = i;
        }

        @Override
        public Long call() throws Exception {
            long startTime = System.nanoTime();

            WDataObject ob = new WDataObject(i);
            for (int s = 1; s <= SAMPLE_LIMIT; s++) {
                ob.dataList.add((double) i / (double) s);
            }
            long endTime = System.nanoTime();

            return (endTime - startTime);
        }

    }

    class WDataObject implements Comparable<WDataObject> {
        private final int id;
        ArrayList<Double> dataList = new ArrayList<>();

        WDataObject(int id) {
            this.id = id;
        }

        public Integer getID() {
            return id;
        }

        public int getId() {
            return id;
        }

        public String toString() {
            String result = "";
            for (double data : dataList) {
                result += df.format(data) + ",";
            }
            return result.substring(0, result.length() - 1);
        }

        @Override
        public int compareTo(WDataObject o) {
            return getID().compareTo(o.getID());
        }
    }
}

Note: This test was run with 10000 tasks submitted to the executor, as for more than that it just takes a lot of time but I doubt that results will change.

Output

    Linear              Non-Linear
1   linear = 1.261564       non linear = 3.831899   
2   linear = 1.098359       non linear = 3.677221   
3   linear = 1.315108       non linear = 3.542210   
4   linear = 1.267752       non linear = 3.415670   
5   linear = 1.249890       non linear = 3.387447   
6   linear = 1.297200       non linear = 4.244616   
7   linear = 1.328806       non linear = 4.821367   
8   linear = 1.362364       non linear = 4.582840   
9   linear = 1.392996       non linear = 5.169028   
10  linear = 1.319172       non linear = 4.734327   
Reversed tests
    Non Linear              Linear
1   non linear = 5.033875       linear = 1.329440   
2   non linear = 4.547303       linear = 1.291331   
3   non linear = 4.613079       linear = 1.353841   
4   non linear = 4.618064       linear = 1.314747   
5   non linear = 4.580547       linear = 1.313031   
6   non linear = 5.371241       linear = 1.338901   
7   non linear = 5.194418       linear = 1.361951   
8   non linear = 4.521603       linear = 1.251608   
9   non linear = 4.474672       linear = 1.304659   
10  non linear = 4.580605       linear = 1.349442   
Done test

Edit **

Just Confirming the findings by @Erwin Boldwidt

Here's the code with double[] array instead of an ArrayList

import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;

public class TestPerf {
    int SAMPLE_LIMIT = 10000;
    DecimalFormat df = new DecimalFormat("#.####");
    public static final int TEST_COUNT = 10;

    public static void main(String[] args) {
        TestPerf main = new TestPerf();
        int nTestElements = 10000;

        System.out.println("\tLinear\t\t\t\tNon-Linear");
        for (int i = 0; i < TEST_COUNT; i++) {
            System.out.print((i + 1));
            main.startWorkerThreads(false, nTestElements, false);
            main.startWorkerThreads(true, nTestElements, false);
            System.out.println("");
        }

        System.out.println("Reversed tests");
        System.out.println("\tNon Linear\t\t\t\tLinear");
        for (int i = 0; i < TEST_COUNT; i++) {
            System.out.print((i + 1));
            main.startWorkerThreads(true, nTestElements, false);
            main.startWorkerThreads(false, nTestElements, false);
            System.out.println("");
        }

        System.out.println("Done test");
    }

    private void startWorkerThreads(boolean userWorkerThreads, int testLimit, boolean outPutSampleResults) {

        if (userWorkerThreads) {

            try {
                // do fast test
                ExecutorService pool = Executors.newFixedThreadPool(6);

                Set<Future<Long>> futureSet = new HashSet<Future<Long>>();

                for (int i = 1; i <= testLimit; i++) {
                    Callable<Long> worker = new Worker(i);
                    futureSet.add(pool.submit(worker));
                }

                pool.shutdown();
                pool.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);

                //checking futures after all have returned, don't want to wait on each
                long executionTime = 0;
                for(Future<Long> future : futureSet) {
                    executionTime += future.get();
                }

                System.out.printf("\tnon linear = %f\t", (executionTime / 1e9));
            } catch (InterruptedException e) {
                e.printStackTrace();
            } catch (ExecutionException e) {
                e.printStackTrace();
            }
        } else {
            // do linear test.

            long timeDelta = 0;
            for (int i = 1; i <= testLimit; i++) {
                long startTime = System.nanoTime();

                WDataObject ob = new WDataObject(i);
                for (int s = 1; s <= SAMPLE_LIMIT; s++) {
                    ob.dataList[s-1] = (double) i / (double)s;
                }

                long endTime = System.nanoTime();
                timeDelta += (endTime - startTime);
            }
            System.out.printf("\tlinear = %f\t",(timeDelta / 1e9));
        }
    }

    class Worker implements Callable<Long> {
        int i;
        Worker(int i) {
            this.i = i;
        }

        @Override
        public Long call() throws Exception {
            long startTime = System.nanoTime();

            WDataObject ob = new WDataObject(i);
            for (int s = 1; s <= SAMPLE_LIMIT; s++) {
                ob.dataList[s-1] = (double) i / (double)s;
            }
            long endTime = System.nanoTime();

            return (endTime - startTime);
        }

    }

    class WDataObject implements Comparable<WDataObject> {
        private final int id;
        double[] dataList = new double[SAMPLE_LIMIT];

        WDataObject(int id) {
            this.id = id;
        }

        public Integer getID() {
            return id;
        }

        public int getId() {
            return id;
        }

        public String toString() {
            String result = "";
            for (double data : dataList) {
                result += df.format(data) + ",";
            }
            return result.substring(0, result.length() - 1);
        }

        @Override
        public int compareTo(WDataObject o) {
            return getID().compareTo(o.getID());
        }
    }
}

Below is the output after changes.

    Linear              Non-Linear
1   linear = 0.954303       non linear = 1.582391   
2   linear = 0.926418       non linear = 1.581830   
3   linear = 0.600321       non linear = 1.454271   
4   linear = 0.599520       non linear = 1.606025   
5   linear = 0.608767       non linear = 1.529756   
6   linear = 0.592436       non linear = 1.546165   
7   linear = 0.587736       non linear = 1.525757   
8   linear = 0.593176       non linear = 1.599800   
9   linear = 0.586822       non linear = 1.452616   
10  linear = 0.613389       non linear = 1.497857   
Reversed tests
    Non Linear              Linear
1   non linear = 1.654733       linear = 0.591032   
2   non linear = 1.554027       linear = 0.600774   
3   non linear = 1.492715       linear = 0.587769   
4   non linear = 1.574326       linear = 0.603979   
5   non linear = 1.536751       linear = 0.590862   
6   non linear = 1.628588       linear = 0.585333   
7   non linear = 1.591440       linear = 0.604465   
8   non linear = 1.444600       linear = 0.587350   
9   non linear = 1.562186       linear = 0.607937   
10  non linear = 1.559000       linear = 0.586294   
Done test

Now non linear part is working about 3 times faster than linear one.

Thanks for confirming. Wouldn't you expect that the multi-threaded parts of the program to out preform the linear single loop part. The reason I sorted the objects in the test was because what good would it be to use a non-linear algorithm if the resulting data was random... so I considered sorting the resulting data as part of the overhead of getting the job done with a that algorithm vs the linear one. — Siraj, Jan 03 '16 at 01:44
Yes I would expect the non linear algorithm to work faster, I'm still trying to find the reason for this behavior. — 11thdimension, Jan 03 '16 at 01:52
I also rewrote the example to include the correction and, indeed the time was dramatically improved, but the results are still unexpected. In your revised example look at the completion times. I took the average of the values. For the forward test, the average single loop time is 0.6662888 seconds while the non-linear loop is 1.5376468. This is inversely true for your revers test. The reverse loop average is 0.5945795 and the reversed non-linear average is 1.5598366 seconds. Also, I still notices that the linear loop is causing all of my workstations cores work at max load. — Siraj, Jan 03 '16 at 14:39
Why would max load occur on all cores when a single nested loop is running on a single thread? — Siraj, Jan 03 '16 at 14:41
See this answer by @Peter Lawrey http://stackoverflow.com/questions/8485947/why-does-a-single-threaded-process-execute-on-several-processors-cores — 11thdimension, Jan 03 '16 at 17:58
Nice find! That explains the CPU workload distribution. So I am assuming, forgive me if I am wrong, that the reason that multithreaded non-linear algorithm has a longer run time is because switching multiple threads across cores is less efficient than moving one? — Siraj, Jan 03 '16 at 18:17
This also raises the flag: what is the value of using multithreaded complexities when the OS cpu scheduler seems to do a better job of balancing a work load than the language and compiler it's self? — Siraj, Jan 03 '16 at 18:20
OS will not be balancing it, all execution will still be sequential, at a single moment only one CPU core will be in use. — 11thdimension, Jan 03 '16 at 19:10
Granted, but what about the results. The fact is a single loop is out preforming a multi threaded worker group, now matter how you look at it. Which to me kind of goes against everything I learned about the virtues of taking the time to learn multi threaded programming... the fact that we have reproducible results with this scenario means that we should consider at what point is the use of multi threaded algorithms appropriate and inappropriate. So I guess this discussion has raised that question. Looks like it is more effective to do statistics with embedded loops, or at least not with a pool — Siraj, Jan 03 '16 at 19:16

Java Thread Worker Group performance vs Embedded Looping performance

And here goes a sample of the output of running this program...

2 Answers2