Sorting 2 arrays using 2 threads takes more time than sorting the 2 arrays one by one

Question

I have 2 unsorted arrays and 2 copies of these arrays. I am using two different threads to sort two arrays, then I am sorting other two unsorted array one by one. What I thought was that the thread process would be faster but it's not, so how does threads take more time?

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>

struct thread_data
{
    int count;
    unsigned int *arr;
};

struct thread_data thread_data_array[2];

void insertionSort(unsigned int arr[], int n)
{
   int i, key, j;
   for (i = 1; i < n; i++)
   {
       key = arr[i];
       j = i-1;

       while (j >= 0 && arr[j] > key)
       {
           arr[j+1] = arr[j];
           j = j-1;
       }
       arr[j+1] = key;
   }
}

void *sortAndMergeArrays(void *threadarg)
{
    int count;
    unsigned int *arr;
    struct thread_data *my_data;

    my_data = (struct thread_data *) threadarg;
    count = my_data->count;
    arr =  my_data->arr;

    insertionSort(arr, count);

    pthread_exit(NULL);
}

int main(int argc, char *argv[])
{
    int count, i, rc;
    clock_t start, end, total_t;
    pthread_t threads[2];

    //get the loop count. If loop count is not provided take 10000 as default loop count.
    if(argc == 2){
        count = atoi(argv[1]);
    }
    else{
        count = 10000;
    }

    unsigned int arr1[count], arr2[count], copyArr1[count], copyArr2[count];    

    srand(time(0));

    for(i = 0; i<count; i++){
        arr1[i] = rand();
        arr2[i] = rand();

        copyArr1[i] = arr1[i];
        copyArr2[i] = arr2[i];
    }

    start = clock();
    for(int t=0; t<2; t++) {
        thread_data_array[t].count = count;
        if(t==0)
            thread_data_array[t].arr = arr1;
        else
            thread_data_array[t].arr = arr2;    

        rc = pthread_create(&threads[t], NULL, sortAndMergeArrays, (void *) &thread_data_array[t]);
        if (rc) {
                printf("ERROR; return code from pthread_create() is %d\n", rc);
                exit(-1);
            }
    }

    pthread_join(threads[0], NULL);
    pthread_join(threads[1], NULL);
    end = clock();

    total_t = (double)(end - start);
    printf("Total time taken by CPU to sort using threads: %d\n", total_t);


    start = clock();
    insertionSort(copyArr1, count);
    insertionSort(copyArr2, count);
    end = clock();

    total_t = (double)(end - start);
    printf("Total time taken by CPU to sort sequentially: %d\n", total_t);

    pthread_exit(NULL);
}

I am using Linux server to execute the code. First I am randomly populating the arrays and copying them to two separate arrays. For the first two arrays I am creating two threads using pthread and passing the two arrays to them, which uses insertion sort to sort them. And for the other two arrays I am just sorting one by one.

I expected that by using threads I would reduce the execution time but actually takes more time.

Depends on how you measure, on the system, and so on. Impossible to answer without a specific example. — Lundin, Oct 06 '17 at 06:28
It's Dell R710 - Dual Six Core Intel Xeon@3.4GHz Linux Machine, I am using my college server to run this code. — Prabhat Pal, Oct 06 '17 at 06:28
Welcome to Stack Overflow. Please read the [About] and [Ask] pages soon. There is nowhere near enough information in the question to be able to give an answer. We'd need to see the sorting code, and how you're running the threads, etc. All in all, we need an MCVE ([MCVE]). How big are the data sets that you're sorting? — Jonathan Leffler, Oct 06 '17 at 06:28
@JonathanLeffler I am using 15000 random numbers to sort using insertion sort. I have also edited the question and the code and more details. — Prabhat Pal, Oct 06 '17 at 06:39
Ok so your benchmarking code includes the thread creation itself (as it probably should). Creating threads takes a lot of resources. It could be that there is too little data to motivate the use of threads. You can try to increase the amount of data and see if the thread version remains the most ineffective one. — Lundin, Oct 06 '17 at 06:43
How much time is it taking? What are the results for threaded and non-threaded sorting? How much difference are you seeing? Are you being allocated two cores or are you constrained to one by the o/s and the resources granted to you? What happens on your own machine? It’s going to be multicore, so what results do you get there? — Jonathan Leffler, Oct 06 '17 at 07:04
With 300000 numbers, the CPU time taken for threaded sort is 324520000 and for non-threaded sort is 321690000. — Prabhat Pal, Oct 06 '17 at 07:10
My educated guess is that the bottleneck of what you're doing is memory and throwing more CPUs at the problem won't help at all and might actually make things slower (because now you have two CPUs fighting for limited memory bandwidth). Sorting is not a CPU problem, it's a memory problem. — Art, Oct 06 '17 at 07:43
I tested a multi-threaded bottom up merge sort on Windows, with a 4 core cpu (Intel 3770K 3.5ghz). 4 thread run was about 3 times as fast as 1 thread. Link to [test code and results](https://codereview.stackexchange.com/questions/148025/multithreaded-bottom-up-merge-sort) . — rcgldr, Oct 06 '17 at 08:36

Jonathan Leffler · Accepted Answer · 2017-10-07T04:22:30.117

Diagnosis

The reason you get practically the same time — and slightly more time from the threaded code than from the sequential code — is that clock() measures CPU time, and the two ways of sorting take almost the same amount of CPU time because they're doing the same job (and the threading number is probably slightly bigger because of the time to setup and tear down threads).

The clock() function shall return the implementation's best approximation to the processor time used by the process since the beginning of an implementation-defined era related only to the process invocation.

BSD (macOS) man page:

The clock() function determines the amount of processor time used since the invocation of the calling process, measured in CLOCKS_PER_SECs of a second.

The amount of CPU time it takes to sort the two arrays is basically the same; the difference is the overhead of threading (more or less).

Revised code

I have a set of functions that can use clock_gettime() instead (code in timer.c and timer.h at GitHub). These measure wall clock time — elapsed time, not CPU time.

Here's a mildly tweaked version of your code — the substantive changes were changing the type of key in the sort function from int to unsigned int to match the data in the array, and to fix the conversion specification of %d to %ld to match the type identified by GCC as clock_t. I mildly tweaked the argument handling, and the timing messages so that they're consistent in length, and added the elapsed time measurement code:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
#include "timer.h"

struct thread_data
{
    int count;
    unsigned int *arr;
};

struct thread_data thread_data_array[2];

static
void insertionSort(unsigned int arr[], int n)
{
    for (int i = 1; i < n; i++)
    {
        unsigned int key = arr[i];
        int j = i - 1;

        while (j >= 0 && arr[j] > key)
        {
            arr[j + 1] = arr[j];
            j = j - 1;
        }
        arr[j + 1] = key;
    }
}

static
void *sortAndMergeArrays(void *threadarg)
{
    int count;
    unsigned int *arr;
    struct thread_data *my_data;

    my_data = (struct thread_data *)threadarg;
    count = my_data->count;
    arr =  my_data->arr;

    insertionSort(arr, count);

    pthread_exit(NULL);
}

int main(int argc, char *argv[])
{
    int count = 10000;
    int i, rc;
    clock_t start, end, total_t;
    pthread_t threads[2];

    // get the loop count. If loop count is not provided take 10000 as default loop count.
    if (argc == 2)
        count = atoi(argv[1]);

    unsigned int arr1[count], arr2[count], copyArr1[count], copyArr2[count];

    srand(time(0));

    for (i = 0; i < count; i++)
    {
        arr1[i] = rand();
        arr2[i] = rand();

        copyArr1[i] = arr1[i];
        copyArr2[i] = arr2[i];
    }

    Clock clk;
    clk_init(&clk);

    start = clock();
    clk_start(&clk);
    for (int t = 0; t < 2; t++)
    {
        thread_data_array[t].count = count;
        if (t == 0)
            thread_data_array[t].arr = arr1;
        else
            thread_data_array[t].arr = arr2;

        rc = pthread_create(&threads[t], NULL, sortAndMergeArrays, (void *)&thread_data_array[t]);
        if (rc)
        {
            printf("ERROR; return code from pthread_create() is %d\n", rc);
            exit(-1);
        }
    }

    pthread_join(threads[0], NULL);
    pthread_join(threads[1], NULL);
    clk_stop(&clk);
    end = clock();

    char buffer[32];
    printf("Elapsed using threads:  %s s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));

    total_t = (double)(end - start);
    printf("CPU time using threads: %ld\n", total_t);

    start = clock();
    clk_start(&clk);
    insertionSort(copyArr1, count);
    insertionSort(copyArr2, count);
    clk_stop(&clk);
    end = clock();

    printf("Elapsed sequentially:   %s s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));
    total_t = (double)(end - start);
    printf("CPU time sequentially:  %ld\n", total_t);

    return 0;
}

Results

Example runs (program inssortthread23) — run on a MacBook Pro (15" 2016) with 16 GiB RAM and 2.7 GHz Intel Core i7 CPU, running macOS High Sierra 10.13, using GCC 7.2.0 for compilation. I had routine background programs running — e.g. browser not being actively used, no music or videos playing, no downloads in progress etc. (These things matter for benchmarking.)

$ inssortthread23 100000
Elapsed using threads:  1.060299 s
CPU time using threads: 2099441
Elapsed sequentially:   2.146059 s
CPU time sequentially:  2138465
$ inssortthread23 200000
Elapsed using threads:  4.332935 s
CPU time using threads: 8616953
Elapsed sequentially:   8.496348 s
CPU time sequentially:  8469327
$ inssortthread23 300000
Elapsed using threads:  9.984021 s
CPU time using threads: 19880539
Elapsed sequentially:   20.000900 s
CPU time sequentially:  19959341
$

Conclusions

Here, you can clearly see that:

The elapsed time is approximately twice as long for the non-threaded code as for the threaded code.
The CPU time for the threaded and non-threaded code is almost the same.
The overall time is quadratic in the number of rows sorted.

All of which are very much in line with expectations — once you realize that clock() is measuring CPU time, not elapsed time.

Minor puzzle

You can also see that I'm getting the threaded CPU time as slightly smaller than the CPU time for sequential sorts, some of the time. I don't have an explanation for that — I deem it 'lost in the noise', though the effect is persistent:

$ inssortthread23 100000
Elapsed using threads:  1.051229 s
CPU time using threads: 2081847
Elapsed sequentially:   2.138538 s
CPU time sequentially:  2132083
$ inssortthread23 100000
Elapsed using threads:  1.053656 s
CPU time using threads: 2089886
Elapsed sequentially:   2.128908 s
CPU time sequentially:  2122983
$ inssortthread23 100000
Elapsed using threads:  1.058283 s
CPU time using threads: 2093644
Elapsed sequentially:   2.126402 s
CPU time sequentially:  2120625
$

$ inssortthread23 200000
Elapsed using threads:  4.259660 s
CPU time using threads: 8479978
Elapsed sequentially:   8.872929 s
CPU time sequentially:  8843207
$ inssortthread23 200000
Elapsed using threads:  4.463954 s
CPU time using threads: 8883267
Elapsed sequentially:   8.603401 s
CPU time sequentially:  8580240
$ inssortthread23 200000
Elapsed using threads:  4.227154 s
CPU time using threads: 8411582
Elapsed sequentially:   8.816412 s
CPU time sequentially:  8797965
$