Multiplying every element of one array by every element of another array and sort the new very large array

Question

Disclaimer This is an exercise of my course, not from an on going contest.

Problem description

The problem description is very straight forward:

You are given two arrays, A and B, containing n and m elements, correspondingly. The numbers which you need to sort are Ai*Bj , for 1 <= i <= n and 1 <= j <= m. In simple words, every element of the first array should be multipled by every element of the second array.

Let C be a result of this sorting, being a non-decreasing sequence of element. Print the sum of every tenth element of this sequence, that is, C1 + C11 + C21 + ... .

1 <= n,m <= 6000

1 <= Ai,Bj <= 40000

Memory limit: 512MB

Time limit: 2 seconds

My solution so far

First I use Java, using Arrays.sort, given the largest n,m. We will need to sort an array with size of 36000000. Then go through every tenth element in the array to get the sum. This passes 23 test cases, and the rest got TLE.

Then I switch to C++, also use the builtin sort method, and the result is just a little bit better, passes 29 test cases.

My observation

Given this input

4 4
7 1 4 9
2 7 8 11

If we sort two array A and B first then multiply them together, we got

2 8 14 18 7 28 49 63 8 32 56 72 11 44 77 99

which is an array with m sorted subarrays. But I couldn't think of any good solution to merge all of these sorted subarray in O(mn) or somewhere around that. Or we need to look at the problem from a different angle, is there any special properties involve with multiplying every elements of two array together?

Update 1: - using MinHeap - not fast enough. [TLE]

Update 2: - using k ways merge - still not fast enough. [TLE]

Update 3: - I forgot to mention about the range of elements in A and B so I've just updated it.

Update 4: - Radix sort base 256 [Accepted]

Conclusion

Through out this problem, I know more about sorting in general and some useful information of sorting with libraries in Java and C++.

Builtin sort methods in C++ like std::sort is not stable because it is basically a quicksort but when the data format is not favorable for quicksort, then it switches to merge sort, but in general it is the fastest builtin sort of C++ (beside qsort, stable_sort).
For Java, there are 3 types of sort, one with Arrays.sort(primitive[]), which uses merge sort under the hood, Arrays.sort(Object[]) which uses Timsort and Collections.sort which basically calls Arrays.sort to do its heavy processing stuff.

Big thanks to @rcgldr for his radix sort base 256 C++ code, it works like a champ with worse case of 6000*6000 elements, maximum running time is 1.187s.

Interestingly, the std::sort of C++ only failed on last 3 biggest test cases, it works fine with input of size 6000*3000.

`couldn't think of any good solution to merge all of these sorted subarrays in O(n)` What has been the best crossing your mind? (I guess I'm asking *would O(mn log(min(m, n)) do?*) — greybeard, Apr 28 '19 at 07:33
(Elements not being specified to be non-negative doesn't exactly help.) — greybeard, Apr 28 '19 at 08:18
@greybeard they're all nonnegative, you have a solution for this? if yes can you give me a little hint? because just normal sorting is not going to work, no matter how good the sorting step is. — Loc Truong, Apr 28 '19 at 08:22
(I thought mentioning *O(mn log(min(m, n)))* __is__ a hint; should work for mixed sign, too.) — greybeard, Apr 28 '19 at 08:47
do you mean that loop through the array C, and then do some sort of look up with binary search in A or B, so we can know the order of that element when C is sorted? — Loc Truong, Apr 28 '19 at 08:53
See https://stackoverflow.com/questions/4279524/how-to-sort-a-m-x-n-matrix-which-has-all-its-m-rows-sorted-and-n-columns-sorted. — dfrib, Apr 28 '19 at 08:54
given m,n are both 6000, it would be 36000000*log(6000) calculation operations, is it possible to run in 2 seconds (time limit)? — Loc Truong, Apr 28 '19 at 09:01
@greybeard do you have other solutions other than working on the sorted-row matrix? — Loc Truong, Apr 28 '19 at 11:04
(In all fairness, with very disparate *m* and *n*, the complexity of the MinHeap approach I have in mind (close enough to [mksteve's *faster*](https://stackoverflow.com/a/55889953/3789665)) is *O(mn log(min(m, n)) + max(m, n)log(max(m, n)))*: product handling should be dominated by sorting the larger Array when size ratio is large enough.) — greybeard, Apr 28 '19 at 21:01
(What is the difference between 1) MinHeap and 2) k (min(m, n)?) way merge?) — greybeard, Apr 28 '19 at 21:02
basically they are roughly the same, but the MinHeap is more flexible because it can be applied for arrays with different size, but as I've just updated the information about memory and time limit, it seems like neither of these methods will be fast enough. — Loc Truong, Apr 29 '19 at 04:18
Re. *n*th amendment to problem spec: Way back when they trained me to train a machine gun on moving targets. (Not really hand-held.) — greybeard, Apr 29 '19 at 06:20
For Visual Studio, std::sort() switches to heap sort (not merge sort) if the level of nesting gets too deep. It also avoids excessive stack usage by only using recursion on the smaller part of a partition, then looping back to handle the larger part of a partition. — rcgldr, Apr 29 '19 at 07:19
@greybeard - a 2 way merge sort is faster than a k way merge sort that uses a minheap or priority queue. If there are enough registers, a 4 way merge sort is possible without using heap, and about 15% faster than 2 way merge sort, but involves nested if's and fall back to 3-way, 2-way merges and copy. Example of C++ 4 way merge at the end of [this answer](https://stackoverflow.com/questions/34844613/optimized-merge-sort-faster-than-quicksort/34845789#34845789) . — rcgldr, Apr 29 '19 at 07:25
@greybeard - for a 4 way merge sort in java or other languages without goto, bottom up gets too complicated, so top down is used instead as shown at the bottom of [this answer](https://stackoverflow.com/questions/55840901/how-can-i-implement-the-merge-sort-algorithm-with-4-way-partition-without-the-er/55842481#55842481) . — rcgldr, Apr 29 '19 at 07:27
I suppose you meant _'The numbers which you need to sort are **Ai*Bj**'_, not 'Ai*Bi'. — CiaPan, Apr 29 '19 at 07:50

mksteve · Answer 1 · 2019-04-28T15:42:28.653

The clue to your answer lies in your observation...

If we sort two array A and B first then multiply them together, we got 2 8 14 18 7 28 49 63 8 32 56 72 11 44 77 99 which is an array with m sorted subarrays.

So there are n sequences of data which are sorted, and the problem is using these to generate the answer.

Hint 1: Can you solve this using a priority queue. The number of elements in the queue would be the same as the number of sorted lists which are generated.

With

#include <vector>
#include <algorithm>
#include <random>
#include <queue>

Given the following structures (C++)

// helper to catch every tenth element.
struct Counter {
    int mCount;
    double mSum;
    Counter() : mCount(0), mSum(0) {}
    void push_back(int val)
    {
        if (mCount++ % 10 == 0)
        {
            mSum += val;
        }
    }
    double sum() { return mSum; }
};

// Storage in the priority queue for each of the sorted results.
struct Generator {
    int i_lhs;
    int i_rhs;
    int product;
    Generator() : i_lhs(0), i_rhs(0), product(0) {}
    Generator(size_t lhs, size_t rhs, int p) : i_lhs(lhs), i_rhs(rhs), product(p)
    {
    }
 };

// comparitor to get lowest value product from a priority_queue
struct MinHeap
{
    bool operator()(const Generator & lhs, const Generator & rhs)
    {
        if (lhs.product > rhs.product) return true;
        return false;
    }
};

I measured ....

double Faster(std::vector<int> lhs, std::vector<int>  rhs)
{
    Counter result;
    if (lhs.size() == 0 || rhs.size() == 0) return 0;

    std::sort(lhs.begin(), lhs.end());
    std::sort(rhs.begin(), rhs.end());
    if (lhs.size() < rhs.size()) {
        std::swap(lhs, rhs);
    }
    size_t l = 0;
    size_t r = 0;
    size_t lhs_size = lhs.size();
    size_t rhs_size = rhs.size();
    std::priority_queue<Generator, std::vector< Generator >, MinHeap > queue;
    for (size_t i = 0; i < lhs_size; i++) {
        queue.push(Generator(i, 0, lhs[i] * rhs[0]));
    }
    Generator curr;
    while (queue.size()) {
        curr = queue.top();
        queue.pop();
        result.push_back(curr.product);
        curr.i_rhs++;
        if( curr.i_rhs < rhs_size ){
            queue.push(Generator(curr.i_lhs, curr.i_rhs, lhs[curr.i_lhs] * rhs[curr.i_rhs]));
        }
    }
    return result.sum();
 }

to be faster than the following naive implementation

double Naive(std::vector<int> lhs, std::vector<int>  rhs)
{
    std::vector<int> result;
    result.reserve(lhs.size() * rhs.size());
    for (size_t i = 0; i < lhs.size(); i++) {
        for (size_t j = 0; j < rhs.size(); j++) {
            result.push_back(lhs[i] * rhs[j]);
        }
    }
    std::sort(result.begin(), result.end());
    Counter aCount;
    for (size_t i = 0; i < result.size(); i++) {
        aCount.push_back(result[i]);
    }
    return aCount.sum();
}

Sorting the input vectors is much faster than the output vector. For each row we create a generator, which will iterate over all the columns. The current product is added as the priority value to the queue, and once we have all generators made, we read them out of the queue.

Then if there is another column left for each generator, we add it back to the queue. This is from the observation that there were m subarrays of size n in the output of a pre-sorted input. The queue holds all m current minimum value for each sub-array, and the smallest of that set, is the smallest remaining of the whole list. When a Generator is removed and re-added, it ensures the top value is the next smallest item of the results.

The loop is still O(nm), as each generator is created once, reading the smallest value is O(1), and inserting into the Queue is O( log n). Which we do once for each row, so O( nm * log n + nm) which simplifies to O( nm log n ).

The Naive solution is O(nm log nm).

The performance bottle neck I found from the above solution, was the cost of inserting into the queue, and I had a performance speed up for that, but I don't think it is algorithmically" much faster.

Thanks, I've just googled about it, how to sort "k sorted arrays". There are several methods but I'm also thinking of a new way, but I don't know if it's possible. Anyway I will try the MinHeap approach first and see if it works. — Loc Truong, Apr 28 '19 at 12:09
It's not fast enough :( not much different from sorting the entire mxn array. — Loc Truong, Apr 28 '19 at 13:03
I've updated, after trying MinHeap and K ways merge for merging K sorted arrays, it's not much different than the naive solution using builtin sort method in C++ or Java. — Loc Truong, Apr 28 '19 at 16:38

rcgldr · Accepted Answer · 2019-04-29T07:19:42.977

1

merge all of these sorted subarray in O(mn)

The products are < 2^31, so 32 bit integers are enough, and a radix sort base 256, will work. The sum of every 10th item could need 64 bits.

Update - you didn't mention a memory limit of 256MB in your comments, I just noticed this. The input array size is 6000*6000*4 = 137.33MB. Allocate a working array half the size of the original array (rounded up: work_size = (1+original_size)/2 ), worst case, 3000*6000 elements (< 210MB total space needed). Treat the original (product) array as two halves and use radix sort to sort the two halves of the original array. Move the lower sorted half into the working array, then merge the working array with the upper half of the original array back into the original array. On my system (Intel 3770K 3.5 ghz, Win 7 Pro 64 bit), the 2 radix sorts will take less than 0.4 seconds (~0.185 seconds each) and the one time merge of 3000*6000 integers will take about 0.16 seconds, less than 0.6 seconds for the sort part. With this approach, there's no need to sort A or B before doing the multiply.

Are you allowed to use SIMD / xmm registers to do the outer product multiply of A and B (A o.x B) ?

Example C++ code for base 256 radix sort:

//  a is input array, b is working array
uint32_t * RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0};            // count / index matrix
size_t i,j,m,n;
uint32_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 4; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }       
    }
    for(j = 0; j < 4; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 4; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
    return(a);
}

Merge sort could be used, but it's slower. Assuming m >= n, then a conventional 2 way merge sort will take O(mn ⌈log2(n)⌉) to sort n sorted runs, each of of size m. On my system, sorting 6000 runs of 6000 integers takes about 1.7 seconds, and I don't know how long the matrix multiply would take.

Using a heap or other form of priority queue will just add to the overhead. A conventional 2-way merge sort would be faster that k-way merge sort with heap.

On a system with 16 registers, 8 of which are used as working and ending indexes or pointers to runs, a 4 way merge sort (without heap) will probably be a bit faster (around 15%), it's the same total number of operations, 1.5 x number of compares, but 0.5 x number of moves, which is a bit more cache friendly.

edited Apr 29 '19 at 07:19

answered Apr 28 '19 at 16:04

rcgldr

27,407
3
36
61

Thanks, I will try radix sort, it seems to be my last hope as I've tried MinHeap and K ways merge. – Loc Truong Apr 28 '19 at 16:29
Sorry I forgot to add the input range for A and B, all elements in A,B are nonnegative and do not exceed 40000. – Loc Truong Apr 28 '19 at 16:37
Let there be d digits in input integers. Radix Sort takes O(d*(n+b)) time where b is the base for representing numbers. So given the worst case, the running time is O(10*(36000000+10)), I don't think it will pass. – Loc Truong Apr 28 '19 at 16:41
I've also just updated information about memory limit and time limit. Given the naive solution using builtin sort in C++ and Java O(mn logmn), it got time out. I don't see it's better in this situation. – Loc Truong Apr 28 '19 at 16:57
do you have an efficient implementation of radix sort base 256 there? it doesn't matter the language, as I've never implemented radix sort in base 256 before so it's useful to see how to implement it efficiently. – Loc Truong Apr 28 '19 at 17:06
Yes I do use primitive type in Java for this problem. For the radix sort base 256 implementation, either C++ or Java is great. – Loc Truong Apr 28 '19 at 17:14
@LocTruong - I updated my answer to deal with the 256MB limit. Just need to use a working buffer 1/2 the size of the array, worst case 3000*6000 integers. This will take less than 210MB of space. Use radix sort on each half, then merge the two halves, should take less than 1 second for this. – rcgldr Apr 28 '19 at 19:40
I see they use C++ compiler of Visual Studio 2017, does your code work well with that compiler? – Loc Truong Apr 29 '19 at 04:21
Also would you mind adding the code for the whole program, I still don't know how to use your code. Maybe with a little chart comparison between your code and builtin sort in C++ would be meaningful. – Loc Truong Apr 29 '19 at 04:38
sorry, the memory limit is 512MB, not 256MB, my bad @@ It will be great if you have source code in Java, thank you in advance – Loc Truong Apr 29 '19 at 06:10
Thanks bro, your code work, just make working array as big as the input array, because we have 512MB. I will this code into my library if you allow, so that I can use it latter in other occasions like this :D – Loc Truong Apr 29 '19 at 06:33
@LocTruong - I updated my answer to show code to call and check the sort. You can copy, use, and/or modify the code for use now and later. – rcgldr Apr 29 '19 at 07:14

Multiplying every element of one array by every element of another array and sort the new very large array

2 Answers2