2

I've looked everywhere, but apparently I can't seem to find the correct keywords to search for a proper solution, so here goes the problem:


*

I have a set of P elements [A, B ....Y, Z], and a matrix of PxP values which represent the similarity between every pair of elements (so the main diagonal is 100% and every other cell has a number between 0% and 100%). I want to partition this set into groups of N elements, so that the solution tends to minimize the average inner similarity of the groups

.*


Can you guys give me any insights into how to do this? I have tried looking into standard partition algorithms but most of them dont apply because the weights depend on pairs, not individuals.

Thank you!!

  • This sounds NP-complete. It's similar to the [clique cover problem](https://en.wikipedia.org/wiki/Clique_cover), but I don't see an obvious reduction. In any case, you'll probably have to settle for a good but imperfect solution. I'm not sure what sorts of heuristics or approximation algorithms would be appropriate. – user2357112 Jun 23 '16 at 21:40
  • A good but imperfect solution is enough! I have though of some possible algorithms myself, but I wanted to find out if there are any known approaches for this problem. – Fadi El Didi Jun 23 '16 at 21:46
  • A comment on the last statement ("pairs"). This problem can actually be considered a "standard" partition problem, namely: divide {1,..,P} into N partitions S_1,..,S_N in order to minimize the sum of all superdiagonal elements in the corresponding matrices. That means you've got a direct goal function which you can optimize (possibly using Dynamic Programming, but I'm not sure). How large are P and N? – davidhigh Jun 23 '16 at 21:59
  • and as you are not strictly looking for the optimum: I like simulated annealing for such problems as it's conceptually simple. One starts from a random solution (or better, some appropriate heuristic or greedy solution), then defines a set of moves (e.g., put random particle from partition i to patrition j, or exchange two random particles) and then lets it evolve in the optimum. For bening problems, this can work quite well, whereas for hard problems it's worse than brute force. – davidhigh Jun 23 '16 at 22:13

2 Answers2

0

Unfortunately this problem is NP-hard, meaning there's unlikely to be a polynomial-time algorithm that solves every instance to optimality. I'll give a reduction from Maximum Bisection. In the decision problem variant of this problem, we're given a graph G and a number k, and asked to partition G's vertices into two equal-size parts such that the number of edges between the two parts is at least k. These slides show that Maximum Bisection is NP-hard by reduction from the more general Maximum Cut problem, where the 2 parts are not required to have the same number of vertices.

Given a graph G = (V, E) and number k, the reduction is:

  • Create a matrix X where X[i][j] = X[j][i] = 1 if (i, j) is an edge in G, and 0 otherwise.
  • Choose N = |V|/2. (This will cause 2 groups to be output.)

Run any exact algorithm for your problem on this constructed input, and let the optimal solution delivered by this algorithm have average similarity y. y = (y1+y2)/2, where y1 and y2 are the average similarities of each group. Let's call the number of similar unordered pairs (that is, unordered pairs (i, j) such that X[i][j] = 1) in the first group z1. Since the only similarity scores we need to deal with are 1 and 0, y1 is simply z1 divided by the total number of unordered pairs in the first group, which is exactly (|V|/2)(|V|/2-1)/2, so y1 = 2*z1/((|V|/2)(|V|/2-1)). Similarly for y2. Thus in terms of z1 and z2, y = (z1+z2)/((|V|/2)(|V|/2-1)). Since the denominator is a constant, by maximising the average within-group similarity y, your algorithm also maximises z1+z2, that is, it maximises the total number of within-group similar pairs.

The key thing to notice is that in any solution, every edge of the original graph must either appear within one of the groups or between the two different groups: that is, for any solution Y, nEdgesWithinAGroup(Y) + nEdgesBetweenGroups(Y) = |E|, so minimising the number of within-group edges is the same as maximising the number of between-group edges.

Since by assumption the algorithm for your problem returns a solution with minimum possible y, and we have established above that this also implies a minimum possible value of z1+z2, and furthermore that the latter implies a maximum possible number of between-group edges, it follows that the number of edges between the two groups, |E| - z1 - z2, is maximum-possible. Thus all that remains to solve the original Maximum Bisection problem is to compare this value to the given value of k, returning YES if it is >= k and NO otherwise.

The above implies that, given any polynomial-time algorithm for solving your problem, and any instance of the NP-hard Maximum Bisection problem, we could in polynomial time construct an instance of your problem, solve it, and convert the solution to a solution to the original Maximum Bisection problem -- that is, it implies that we could solve an NP-hard problem in polynomial time. This implies that your problem is itself NP-hard.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
0

If im not completely misunderstanding your problem and you want a poor way of doing it here it is: Brute force approach:

  1. Get P choose (P/N) combos where P is number of elements and N is number of groups you want to partition into.
  2. Calculate "inner similarity" for each combo returned in 1.
  3. Get the N least from 2.

Python implementation:

def getValues(matrix):
    values=[]
    count=1
    while ((len(matrix)- count)>0):
        j= count
        for i in range(len(matrix)- count ):
            values.append(matrix[count-1][j ] )
            j+=1
        count+=1
    return values



def c(arr, curr, end,k ,n , comb=[]):
    """get combinations for list length n and choose k elem"""
    if comb is None:
        comb = []
    elif n ==1 :
        comb = []
    if ((arr.count(1) is not k) and (curr < end)):
        tmparr= [ i for i in arr]
        tmparr[curr]= 1
        c(tmparr, curr+ 1 , end,k ,n + 1 , comb)
        tmparr[curr]= 0
        c(tmparr, curr+ 1 , end,k ,n + 1 , comb)
    if arr.count(1) ==k :
        comb.append(arr)
    if n is 1:
        return comb


def combos(l, choose):
    """
    use this w/ c() to get combinations
    """
    arr = [1 for i in l]
    return c(arr,0 , len(l), choose,1 )


def getComb(combos, elem):
    """
    EX. combos=[0,1,1] elem=["A","B","C"]
    return ["B","C"]
    """
    result= [ ]
    for i in combos:
        tmp= ""
        for j in range(len(i)):
            if i[j] is 1:
                tmp+= elem[j]
        result.append(tmp)
    return result

def subSum(sub,d):
    """
    EX. sub = "abc" then return value d["ab"]+d["ac"]+d["bc"]
    sub -- list of string elements
    d -- dictionary
    """
    if( len(sub) is 2):
        return d[sub[0]+ sub [1]]
    sum=0
    for i in range(len(sub)-1) :
        sum+=d [ sub[0]+ sub [i+1] ]
    return sum+ subSum(sub[1:],d)

def contains(a,b):
    for i in a:
        if i in b:
            return True
    return False


#**************INPUT HERE**************#
# elements
e = ["A","B", "C", "D", "E", "F"] # partition set into N
N = 2

matrix =[ [100,2,3,4,5,6],
    [ 2, 100,9,16,23 ,30] ,
    [ 44,22,100,11,5 ,2] ,
    [ 11 ,22,33,100, 44, 55],
    [1 ,6,7,13,100, 20 ],
    [1 ,1,2,3,5,100 ] ]
#**************************************#


if len(matrix) is len(e):
    p = getComb(combos(e,(int)(len( matrix)/N)),e)
    q = getComb(combos(e,2),e)
    values = getValues(matrix)

    # make lookup for subSum()
    d = {}
    for i in range(len(q)):
        d[q[i]]=values[i]

    result=[]
    for i in range(N):
        sums = [subSum(i, d) for i in p]
        m = min(sums)
        s = p[sums.index(m)]
        result.append(s)
        for i in p:
            if contains(s, i):
                p.remove(i)

    print(result)  # this is the answer
Bobas_Pett
  • 591
  • 5
  • 10