3

Given boolean matrix M, I need to find a set of submatrices A = {A1, ..., An} such that matrices in A contain all True values in matrix M and only them. Submatrices don't have to be continuous, i.e. each submatrix is defined by the two sets of indices {i1, ..., ik}, {j1, ..., jt} of M. (For example submatrix could be something like [{1, 2, 5}, {4, 7, 9, 13}] and it is all cells in intersection of these rows and columns.) Optionally submatrices can intersect if this results in better solution. The total number of submatrices n should be minimal.

Size of the matrix M can be up to 10^4 x 10^4, so I need an effective algorithm. I suppose that this problem may not have an effective exact algorithm, because it reminds me some NP-hard problems. If this is true, then any good and fast approximation is OK. We can also suggest that the amount of true values is not very big, i.e. < 1/10 of all values, but to not have accidental DOS in prod, the solution not using this fact is better.

I don't need any code, just a general idea of the algorithm and justification of its properties, if it's not obvious.

Background

We are calculating some expensive distance matrices for logistic applications. Points in these requests are often intersecting, so we are trying do develop some caching algorithm to not calculate parts of some requests. And to split big requests into smaller ones with only unknown submatrices. Additionally some distances in the matrix may be not needed for the algorithm. On the one hand the small amount of big groups calculates faster, on the other hand if we include a lot of "False" values, and our submatrices are unreasonably big, this can slow down the calculation. The exact criterion is intricate and the time complexity of "expensive" matrix requests is hard to estimate. As far as I know for square matrices it is something like C*n^2.5 with quite big C. So it's hard to formulate a good optimization criterion, but any ideas are welcome.

About data

True value in matrix means that the distance between these two points have never been calculated before. Most of the requests (but not all) are square matrices with the same points on both axes. So most of the M is expected to be almost symmetric. And also there is a simple case of several completely new points and the other distances are cached. I deal with this cases on preprocessing stage. All the other values can be quite random. If they are too random we can give up cache and calculate the full matrix M. But sometimes there are useful patterns. I think that because of the nature of the data it is expected to contain more big sumbatrices then random data. Mostly True values are occasional, but form submatrix patterns, that we need to find. But we cannot rely on this completely, because if algorithm gets too random matrix it should be able to at least detect it to not have too long and complex calculations.

Update

As stated in wikipedia this problem is called Bipartite Dimension of a graph and is known to be NP-hard. So we can reformulate it info finding fast relaxed approximations for the simple cases of the problem. We can allow some percentage of false values and we can adapt some simple, but mostly effective greedy heuristic.

Dimitrius
  • 564
  • 6
  • 21
  • The submatrices cannot contain any false values? – ravenspoint Oct 31 '22 at 19:31
  • Don't you mean a percentage of false value? – ravenspoint Oct 31 '22 at 19:39
  • Well, actually, to be honest, we can allow some small percentage of false values, but then it's not obvious how to strictly define the optimization criterion. Generally speaking, matrix set A should be minimal by both sizes of submatrices and the number of submatrices. I will add some background in question. – Dimitrius Oct 31 '22 at 19:57
  • 1
    If I understand the problem correctly, you are looking for a block transformed matrix where blocks have only trues or only falses, ie 1s and 0s, ie, binary. If so, [this post](https://mathoverflow.net/questions/191963/transforming-a-binary-matrix-into-triangular-form-using-permutation-matrices) and its trail of references might be of interest, and as you summise , if your problem is equivalent, it is NP hard. However, if equivalent, there will I presume be lots of algorithmic work already done. I have not been able to start to look through the references. – petern0691 Nov 03 '22 at 12:44
  • Since these are distance matrices, is there any interesting/helpful relationship between points? E.g., if (i,j) and (j,k) are both true, do we know anything about (i,k) or (j,i)? Could you paste the top-left 20x20 corner of the matrix? – Dave Nov 04 '22 at 15:56
  • What percent of cells are True? If you know it, what percent of submatrices defined by exactly two indices along each axis have all four values as True? – Dave Nov 04 '22 at 16:15
  • @Dave I answered part of your question in the question itself. The other part: For now I cannot estimate neither percentage nor probability. But it should be small and tend to have submatrices. If we know (i,j) and (j,k) we cannot say anything about the other values. – Dimitrius Nov 04 '22 at 16:54
  • 1
    If you interpret M as an adjacency matrix, this is called the [bipartite dimension](https://en.wikipedia.org/wiki/Bipartite_dimension) of a graph, and is NP-complete (even for bipartite graphs, as in your case). It's also NP-hard to approximate within a factor of `|V|^(1-c)` for any positive c. However, an equivalent problem (binary matrix factorization) has been very popular and important in ML research recently, and may have practical algorithms you can use. Still, your matrix is rather large. – kcsquared Nov 05 '22 at 00:08

3 Answers3

1

I started working on the algorithm below before you provided the update.

Also, in doing so I realised that while one is looking for blocks of true values, the problem is not one of a block transformation, as you have also now updated.

The algorithm is as as follows:

  1. count the trues in each row
  2. for any row with the maximum count of trues, sort the columns in the matrix so that the row's trues all move to the left
  3. sort the matrix rows in descending order of congruent trues on the left (there will now be an upper left rough triangle of congruent trues)
  4. get the biggest rectangle of trues cornered at the upper left
  5. store the row ids and column ids for that rectangle (this is a sub-matrix definition)
  6. change the the sub-matrix's trues to falses
  7. repeat from the top until the upper left triangle has no trues

This algorithm will produce a complete cover of the boolean matrix consisting of row-column intersection sub-matrices containing only true values.

I am not sure if allowing some falses in a sub-matrix will help. While it will allow bigger sub-matrices to be found and hence reduce the number of passes of the boolean matrix to find a cover, it will presumably take longer to find the biggest such sub-matrices because there will be more combinations to check. Also, I am not sure how one might stop falsey sub-matrices from overlapping. It might need the maintenance of a separate mask matrix rather than using the boolean matrix as its own mask, in order to ensure disjoint sub-matrices.

Below is a first cut implementation of the above algorithm in python.

I ran it on Windows 10 on a Intel Pentium N3700 @ 1.60Ghz with 4GB RAM As is, it will do, with randomly generated ~10% trues:

  • 100 rows x 1000 columns < 7 secs
  • 1000 rows x 100 columns < 6 secs
  • 300 rows x 300 columns < 14 secs
  • 3000 rows x 300 columns < 3 mins
  • 300 rows x 3000 columns < 15 mins
  • 1000 rows x 1000 columns < 8 mins

I have not tested it on approximately symmetric matrices, nor have I tested it on matrices with relatively large sub-matrices. It might perform well with relatively large sub-martrices, eg, in the extreme case, ie, the entire boolean matrix is true, only two passes of the algorithm loop are required.

One area I think there can be considerable optimisation is in the row sorting. The implementation below uses the in-built phython sort with a comparator function. A custom crafted sort function will probably do much better, and possibly especially so if it is a virtual sort similar to the column sorting.

If you can try it on some real data, ie, square, approximately symmetric matrix, with relatively large sub-matrices, it would be good to know how it goes.

Please advise if you would like to me to try some optimisation of the python. I presume to handle 10^4 x 10^4 boolean matrices it will need to be a lot faster.

from functools import cmp_to_key

booleanMatrix0 = [
( 0, 0, 0, 0, 1, 1 ),
( 0, 1, 1, 0, 1, 1 ),
( 0, 1, 0, 1, 0, 1 ),
( 1, 1, 1, 0, 0, 0 ),
( 0, 1, 1, 1, 0, 0 ),
( 1, 1, 0, 1, 0, 0 ),
( 0, 0, 0, 0, 0, 0 )
]

booleanMatrix1 = [
( 0, )
]

booleanMatrix2 = [
( 1, )
]

booleanMatrix3 = [
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 )
]

booleanMatrix4 = [
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 )
]

booleanMatrix14 = [
( 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0 ),
( 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1 ),
( 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0 ),
( 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0 ),
( 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1 ),
( 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1 ),
( 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1 ),
( 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1 ),
( 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1 ),
( 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ),
( 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 )
]

booleanMatrix15 = [
( 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ),
]

booleanMatrix16 = [
( 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1 ),
]

import random

booleanMatrix17 = [
]
for r in range(11):
    row = []
    for c in range(21):
        if random.randrange(5) == 1:
            row.append(random.randrange(2))
        else:
            row.append(0)
    booleanMatrix17.append(tuple(row))

booleanMatrix18 = [
]
for r in range(21):
    row = []
    for c in range(11):
        if random.randrange(5) == 1:
            row.append(random.randrange(2))
        else:
            row.append(0)
    booleanMatrix18.append(tuple(row))

booleanMatrix5 = [
]
for r in range(50):
    row = []
    for c in range(200):
        row.append(random.randrange(2))
    booleanMatrix5.append(tuple(row))

booleanMatrix6 = [
]
for r in range(200):
    row = []
    for c in range(50):
        row.append(random.randrange(2))
    booleanMatrix6.append(tuple(row))

booleanMatrix7 = [
]
for r in range(100):
    row = []
    for c in range(100):
        row.append(random.randrange(2))
    booleanMatrix7.append(tuple(row))

booleanMatrix8 = [
]
for r in range(100):
    row = []
    for c in range(1000):
        if random.randrange(5) == 1:
            row.append(random.randrange(2))
        else:
            row.append(0)
    booleanMatrix8.append(tuple(row))

booleanMatrix9 = [
]
for r in range(1000):
    row = []
    for c in range(100):
        if random.randrange(5) == 1:
            row.append(random.randrange(2))
        else:
            row.append(0)
    booleanMatrix9.append(tuple(row))

booleanMatrix10 = [
]
for r in range(317):
    row = []
    for c in range(316):
        if random.randrange(5) == 1:
            row.append(random.randrange(2))
        else:
            row.append(0)
    booleanMatrix10.append(tuple(row))

booleanMatrix11 = [
]
for r in range(3162):
    row = []
    for c in range(316):
        if random.randrange(5) == 1:
            row.append(random.randrange(2))
        else:
            row.append(0)
    booleanMatrix11.append(tuple(row))

booleanMatrix12 = [
]
for r in range(316):
    row = []
    for c in range(3162):
        if random.randrange(5) == 1:
            row.append(random.randrange(2))
        else:
            row.append(0)
    booleanMatrix12.append(tuple(row))

booleanMatrix13 = [
]
for r in range(1000):
    row = []
    for c in range(1000):
        if random.randrange(5) == 1:
            row.append(random.randrange(2))
        else:
            row.append(0)

    booleanMatrix13.append(tuple(row))

booleanMatrices = [ booleanMatrix0, booleanMatrix1, booleanMatrix2, booleanMatrix3, booleanMatrix4, booleanMatrix14, booleanMatrix15, booleanMatrix16, booleanMatrix17, booleanMatrix18, booleanMatrix6, booleanMatrix5, booleanMatrix7, booleanMatrix8, booleanMatrix9, booleanMatrix10, booleanMatrix11, booleanMatrix12, booleanMatrix13 ]

def printMatrix(matrix, colOrder):
    for r in range(rows):
        row = ""
        for c in range(cols):
            row += str(matrix[r][0][colOrder[c]])
        print(row)
    print()

def rowUp(matrix):

    rowCount = []
    maxRow = [ 0, 0 ]
    for r in range(rows):
        rowCount.append([ r, sum(matrix[r][0]) ])
        if rowCount[-1][1] > maxRow[1]:
            maxRow = rowCount[-1]
    return rowCount, maxRow

def colSort(matrix):
    #  For a row with the highest number of trues, sort the true columns to the left
    newColOrder = []
    otherCols = []
    for c in range(cols):
        if matrix[maxRow[0]][0][colOrder[c]]:
            newColOrder.append(colOrder[c])
        else:
            otherCols.append(colOrder[c])
    newColOrder += otherCols
    return newColOrder

def sorter(a, b):
    #  Sort rows according to leading trues
    length = len(a)
    c = 0
    while c < length:
        if a[0][colOrder[c]] == 1 and b[0][colOrder[c]] == 0:
                return -1
        if b[0][colOrder[c]] == 1 and a[0][colOrder[c]] == 0:
                return 1
        c += 1
    return 0

def allTrues(rdx, cdx, matrix):
    count = 0
    for r in range(rdx+1):
        for c in range(cdx+1):
            if matrix[r][0][colOrder[c]]:
                count += 1
            else:
                return
    return rdx, cdx, count
            
def getBiggestField(matrix):
    #  Starting at (0, 0) find biggest rectangular field of 1s
    biggestField = (None, None, 0)
    cStop = cols
    for r in range(rows):
        for c in range(cStop):
            rtn = allTrues(r, c, matrix)
            if rtn:
                if rtn[2] > biggestField[2]:
                    biggestField = rtn
            else:
                cStop = c
                break;
        if cStop == 0:
            break
    return biggestField

def mask(matrix):
    maskMatrix = []
    for r in range(rows):
        row = []
        for c in range(cols):
            row.append(matrix[r][0][c])
        maskMatrix.append([ row, matrix[r][1] ])
    maskRows = []
    for r in range(biggestField[0]+1):
        maskRows.append(maskMatrix[r][1])
        for c in range(biggestField[1]+1):
            maskMatrix[r][0][colOrder[c]] = 0
    maskCols= []
    for c in range(biggestField[1]+1):
        maskCols.append(colOrder[c])
    return maskMatrix, maskRows, maskCols


#   Add a row id to each row to keep track of rearranged rows
rowIdedMatrices = []
for matrix in booleanMatrices:
    rowIdedMatrix = []
    for r in range(len(matrix)):
        rowIdedMatrix.append((matrix[r], r))
    rowIdedMatrices.append(rowIdedMatrix)

import time

for matrix in rowIdedMatrices:

    rows = len(matrix)
    cols = len(matrix[0][0])
    colOrder = []
    for c in range(cols):
        colOrder.append(c)
    subMatrices = []

    startTime = time.thread_time()
    loopStart = time.thread_time()
    loop = 1
    rowCount, maxRow = rowUp(matrix)
    ones = 0
    for row in rowCount:
        ones += row[1]
    print( "_________________________\n", "Rows", rows, "Columns", cols, "Ones", str(int(ones * 10000 / rows / cols) / 100) +"%")
    colOrder = colSort(matrix)
    matrix.sort(key=cmp_to_key(sorter))
    biggestField = getBiggestField(matrix)
    if biggestField[2] > 0:
        maskMatrix, maskRows, maskCols = mask(matrix)
        subMatrices.append(( maskRows, maskCols ))

    while biggestField[2] > 0:
        loop += 1
        rowCount, maxRow = rowUp(maskMatrix)
        colOrder = colSort(maskMatrix)
        maskMatrix.sort(key=cmp_to_key(sorter))
        biggestField = getBiggestField(maskMatrix)
        if biggestField[2] > 0:
            maskMatrix, maskRows, maskCols = mask(maskMatrix)
            subMatrices.append(( maskRows, maskCols) )
        if loop % 100 == 0:
            print(loop, time.thread_time() - loopStart)
            loopStart = time.thread_time()

    endTime = time.thread_time()
    print("Sub-matrices:", len(subMatrices), endTime - startTime)
    for sm in subMatrices:
        print(sm)

    print()
    input("Next matrix")
petern0691
  • 868
  • 4
  • 11
  • Thank you! I have implemented something based on your idea in rust, it works fast enough and produces more or less good result. Now I see that after building initial solution we can do some branch and bound or local search (ie simulated annealing) to improve the solution father if we have time. – Dimitrius Nov 09 '22 at 08:45
0
  • LOOP over true values
  • Can you grow the submatrix containing the true value in any direction

( i.e can you go from

 t

to

tt
tt  

)

  • Keep growing for as long as possible
  • Set all cells in M that are in the new submatrix to false
  • Repeat until every cell in M is false.

Here is a simple example of how it works

enter image description here

The top picture shows the large Matrix M containing a few true values

The bottom rows show the first few iteration, with the blus submatric growing as it finds more adjacent cells with true values. In this case I have stopped because it cannot grow any durther without including false cells. If a few cells in a submatrix can be false, then you could continue a bit further.

ravenspoint
  • 19,093
  • 6
  • 57
  • 103
  • 1
    The problem with this approach is that sometimes you can have several directions to grow. For example you may need to choose between growing (3,3) matrix to (4, 4), (3, 5), or (5, 3) with unknown father perspectives. Besides the time complexity is quite bad. Every grow search is at least 2 * size * len(M). If all values are True, this results in len(M)^4 complexity. – Dimitrius Oct 31 '22 at 19:55
  • Grow in any direction and every direction. Keep growing until you cannot grow any more. – ravenspoint Oct 31 '22 at 19:57
  • The grow search is tiny! It is, at worst, 2 * ( width + length ) of the SUBMATRIX ( not of M like you suggest ) So first search is just 4 cells. – ravenspoint Oct 31 '22 at 20:01
  • 1
    It can be 4 cells if we searched for continuous submatrices, but we look for any submatrices, so for every 2 points in one row we should check the whole 2 columns. – Dimitrius Oct 31 '22 at 20:14
  • I think you should re-read my description, carefully. You never have to search an entire column ( or row ) Just the cells adjacent to the submatrix, to see if it is possible to grow any more. – ravenspoint Oct 31 '22 at 21:24
  • I have added a graphical description of how the algorithm operates. Can you see how it only looks at a few cells on each iteration? – ravenspoint Oct 31 '22 at 21:40
  • 1
    Thank you very much for your effort, maybe I was not clear enough. You algorithm works perfectly if submatrix is something like [[1, 1, 0], [1, 1, 0], [0, 0, 0]] idx=(0,1) (0, 1). But I need [[1, 0, 1], [0, 0, 0], [1, 0, 1]] where idx=(0, 2), (0, 2) to also work. Or for a bigger matrix idx could be (0, 3, 5), (4, 9, 12). These submatrices are not continuous, not adjacent. And they are harder to find. But your visualization is useful, it helps me with some ideas. – Dimitrius Oct 31 '22 at 22:25
  • What is a "continuous matrix"? In any case, your question states that : "Submatrices don't have to be continuous" – ravenspoint Nov 01 '22 at 00:02
0

Let's say M is an s by t matrix. The trivial (but possibly useful) solution is just to take all the non-empty columns (or rows) as your submatrices. This will result in at most min(s,t) submatrices.

Dave
  • 7,460
  • 3
  • 26
  • 39