how to calculate the minimum unfairness sum of a list

Question

I have tried to summarize the problem statement something like this::

Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.

For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.

To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values

MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|

If you actually need to look at the problem statement, It's HERE

My Objective

given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]

what I have tried

well, there is a lot to be added in here, so I'll try to be as short as I can.

My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??

You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.

from itertools import combinations as cmb
from statistics import variance as varn

def LetMeDoIt(n, k, arr):
    v = []
    s = []
    subs = [list(x) for x in list(cmb(arr, k))]  # getting all sub arrays from arr in a list

    i = 0
    for sub in subs:
        if i != 0:
            var = varn(sub)  # the variance thingy
            if float(var) < float(min(v)):
                v.remove(v[0])
                v.append(var)
                s.remove(s[0])
                s.append(sub)
            else:
                pass

        elif i == 0:
            var = varn(sub)
            v.append(var)
            s.append(sub)
            i = 1

    final = []
    f = list(cmb(s[0], 2))  # getting list of all pairs (after determining sub array with least MUS)
    
    for r in f:
        final.append(abs(r[0]-r[1]))  # calculating the MUS in my messy way

    return sum(final)

The above code works fine for n<30 but raised a MemoryError beyond that. In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.

I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).

Second Approach

from itertools import combinations as cmb

def myvar(arr):   # a function to calculate variance
    l = len(arr)
    m = sum(arr)/l
    return sum((i-m)**2 for i in arr)/l

def LetMeDoIt(n, k, arr):
    sorted_list = sorted(arr)  # i think sorting the array makes it easy to get the sub array with MUS quickly
    variance = None
    min_variance_sub = None
    
    for i in range(n - k + 1):
        sub = sorted_list[i:i+k]
        var = myvar(sub)
        if variance is None or var<variance:
            variance = var
            min_variance_sub=sub
            
    final = []
    f = list(cmb(min_variance_sub, 2))  # again getting all possible pairs in my messy way

    for r in f:
        final.append(abs(r[0] - r[1]))

    return sum(final)

def MainApp():
    n = int(input())
    k = int(input())

    arr = list(int(input()) for _ in range(n))

    result = LetMeDoIt(n, k, arr)

    print(result)    

if __name__ == '__main__':
    MainApp()

This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).

=====

How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)

(for your references you can have a look on

successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can explain that approach for me and future visitors)
an editorial by the problem setter explaining how to approach the question
a solution code by problem setter himself (py2, C++).
Input data (test cases) and expected output

Edit1 ::

For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)

As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)

Thank you for any help or suggestions :)

I have mixed feelings about this (interesting) question, since this is a hackerrank challenge and asking for help on StackOverflow defeats the purpose of the challenge... — Attersson, Sep 07 '20 at 08:48
"if mylist = [1, 2, 5, 5, 6], then [...] `MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-5|` " are you sure you're not missing `+ |5-6| + |5-6|` at the end here? — Stef, Sep 07 '20 at 08:58
@Stef oh that's a typo :/ Thanks for pointin that out :). Edited. — P S Solanki, Sep 07 '20 at 09:02
Please note that "subarray" is considered [a contiguous section of the array](https://stackoverflow.com/questions/26568560/difference-between-subarray-subset-subsequence). Did you mean "subset" rather? — גלעד ברקן, Sep 07 '20 at 11:21
@גלעדברקן Thank you for the information. After reading the answers on the question you linked to, i would confirm i need `subarrays` (each `subarray` must be of length `k`). — P S Solanki, Sep 07 '20 at 11:31
The editorial you linked to states, "In this problem, we are given a list of N numbers out of which K numbers are to be chosen such that the unfairness sum is minimized." There is no mention that the K numbers need to be contiguous. The K contiguous numbers are chosen in the editorial from the *sorted* array, which means they were not necessarily contiguous in the original array. It's important for you to clarify this in the problem statement. — גלעד ברקן, Sep 07 '20 at 12:02
@גלעדברקן Oh yeah, now that I fully understand what contiguous means, The elements need NOT be contiguous. They can be in any order that makes minimum unfairness sum minimum for those k elements. — P S Solanki, Sep 07 '20 at 12:13
Your link to the original problem statement has died. Please either post a permanent link or copy the text into your question. — MattDMo, Sep 08 '20 at 15:50
@MattDMo I just rechecked all the hyperlinks in the post, and they all seemed to work fine. I've still revised them again. — P S Solanki, Sep 08 '20 at 16:15
@PSSolanki I saw you were in chat earlier. Ping me when you go back and we can discuss without cluttering up the comments here. — MattDMo, Sep 08 '20 at 17:29

IoaTzimas · Answer 1 · 2020-09-07T09:26:21.847

2

You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.

For example if the list is

[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7 The same for [1,7,10] (non consecutive on the left side) as 1<3

Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly

Regarding coding, something like this should work:

def myvar(array):
    return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])  
  
def minsum(n, k, arr):
        res=1000000000000000000000 #alternatively make it equal with first subarray
        for i in range(n-k):
            res=min(res, myvar(l[i:i+k]))
        return res

edited Sep 07 '20 at 09:26

answered Sep 07 '20 at 08:57

IoaTzimas

10,538
2
13
30

regarding the code you suggested, defining the variable `res` at the start - is that something related to dynamic programming (i know what is it, i don't know how to do it xD) ? And where do you stand on that `variance vs MUS thing` (their co-relation) ? – P S Solanki Sep 07 '20 at 09:11
1

res is the initial value of minimum sub. We must set a very large value on it initially and it will be gradually be replaced by lower values inside the loop. When loop ends res will be the lowest unfair sum that we are looking for. If there is a chance that final res will be very large, instead of 1000000000000 you can set res=myvar(l[0:k]) initially so that it will always return the correct result – IoaTzimas Sep 07 '20 at 09:15
I get that. I will just set a random value actually, at least to save a few precious microseconds ;) (if i'm true on that). – P S Solanki Sep 07 '20 at 09:18
1

I have added a myvar function in my code. Check if the entire code does the work – IoaTzimas Sep 07 '20 at 09:20
It works but terminates due to time out for n<=10000. The exact code I tried was this: [My Code](https://paste.atilla.org/paste/9L5F25)... Please have a look. If you are interested you can have a look on three different successful submissions [HERE](https://paste.atilla.org/paste/0K9PFA) - so to either explain their approach to me & future visitors OR suggest some other approach based on those. Just FYI the mentioned rime complexity for this dynamic programming question should be `n log n` – P S Solanki Sep 07 '20 at 12:03

score 1 · Accepted Answer · answered Sep 07 '20 at 15:59

1

I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.

The original array must be sorted. This has a complexity of O(NlogN)
At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer
The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).

It works but terminates due to time out for n<=10000.

(from comments on archer's question)

To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc. If it helps, think of it like a snake. One block is removed and one is added. There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).

So the final complexity will asymptotically be O(NlogN) due to the first sort.

answered Sep 07 '20 at 15:59

Attersson

4,755
1
15
29

I respect the fact that you considered the integrity of hackerrank, but I have already cleared this practice problem, half the test cases on my own, rest successful ones from these discussions, and all the test cases with the code from other candidates [THAT DOES NOT ADD UP TO MY SCORE - ALSO IT WAS JUST A PRACTICE PROBLEM PUBLISHED IN 2013]. So I will try with the algo you suggested, I will let you let you know, if anything wrong happens :) – P S Solanki Sep 07 '20 at 16:09
1

Yes no problem :-) absolutely no downvote or anything. I still had to specify and help within bounds :) (and see the question score, which is pretty good) – Attersson Sep 07 '20 at 16:11
1

Hehehe. You are absolutely correct on your stand :) .. Anyways i actually added a part of my question `variance vs minimum unfairness sum` on `math.stackexchange` [HERE](https://math.stackexchange.com/q/3817341/822711) and to my surprise, I got an unexpected answer. – P S Solanki Sep 07 '20 at 16:18
1

Very interesting and well written! So the idea to use variance unfortunately will not work. But on the other hand by applying the "snake" optimization, there is no problem with calculating the Unfairness Sum – Attersson Sep 07 '20 at 16:24
I still think we may have cleared all the test cases correctly with the variance approach, had it worked in the time limits specified (I dropped the idea though xD). And forgive me for being silly, but what exactly is the 'snake' optimization (I already feel bad after asking this question :-P ) ? – P S Solanki Sep 07 '20 at 17:08
1

I was trying to avoid periphrasing.. I tried to refer to the final O(1*(N-k)) passage. Add one, remove one, as you scroll the sub arrays finding the one with the minimum unfairness sum. In my answer I wrote "think of it like a snake" – Attersson Sep 07 '20 at 17:10
hmm.. I had similar thoughts :) – P S Solanki Sep 07 '20 at 17:12
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221128/discussion-between-p-s-solanki-and-attersson). – P S Solanki Sep 08 '20 at 05:47
1

@Atterson Looks like I somehow created a chat room (SO forced me to do this, I swear ;p ). actually I tried to implement the approach you mentioned, I understood the 'snake' analogy totally, I understood the reason why I was getting a time out error on previous approaches, I understood how I am supposed to continue solving the algorithm (thanks to your explained answer - i now understand the fundamentals of dynamic programming to reduce time complexity ). I am on the verge of figuring out the math explained in editorial, there is just one step in that whole process I can't seem to figure out. – P S Solanki Sep 08 '20 at 05:58
And that problem is [HERE](https://paste.atilla.org/paste/TJWSN0) – P S Solanki Sep 08 '20 at 06:08
1

Check the author's solution where you can find a formula to better "reverse engineer" and understand. https://paste.atilla.org/paste/31PSXK line 31. to note: `lis` is the array, `sum_lis` the cumulative sum of the array. Also note how in lines 25~27 he calculates the unfairness sum for the first sub array. – Attersson Sep 08 '20 at 08:46

how to calculate the minimum unfairness sum of a list

2 Answers2

Linked