-2

I have a list of list containing the start position of each column in an OCR´d tabular table.

[[16, 102, 119, 136],
 [16, 48, 76, 109, 145],
 [16, 47, 75, 108, 128, 145],
 [16, 48, 77, 110, 141],
 [98, 135]]

The initial idea is to use the longest list as a reference to align the others by similarity. Conceptually is like a fuzzy join but only 1 match for each value is permitted (at least 1 match and at most 1 match).

How can I get from irregular input list to this expected output?

[[16, '', '', 102, 119, 136],
 [16, 48, 76, 109,  '', 145],
 [16, 47, 75, 108, 128, 145],
 [16, 48, 77, 110,  '', 141],
 ['', '', '',  98,  '', 135]]

Global target is to put that string into a dataframe, I am provinding that in case any other approach is proposed. As you can see it has missing headers, and missing cells, so I had the aforementioned idea in ordet to split each string common positions later into a csv.

                Cuentas a  la  banca                                                                  INTERES          DIVISA           EUR 
                CUENTA CORRIENTE EMPRESAS      0000  0000  000000000000    EUR                              0,00 %                              0.00 
                CUTRECUENTA EMPRESAS           0000  0000  000000000000    USD                              0.00 %              00.00            00.00 
                CUENTA CORRIENTE EMPRESAS       0000  0000  000000000000     EUR                              0.00%                          00 000.00 
                                                                                                  TOTAL                                00 000,00 

1 Answers1

0

My problem was how to deal when the position is filled before and the adjacent position was previously filled too, but finally I think this is the answer:

import numpy as np

beginings = [[16, 102, 119, 136], [16, 48, 76, 109, 145], [16, 47, 75, 108, 128, 145], [16, 48, 77, 110, 141], [98, 135]]
# beginings = [[16, 17, 18, 136], [16, 17, 18, 109, 145], [16, 47, 75, 108, 128, 145], [16, 48, 77, 110, 141], [98, 135]] # use that to reproduce possible issue when positons was filled before
num_col = max([len(i) for i in beginings])

# Get longest row as a reference, and others will be matched by similarity from longest row.
index_longest_list = max(enumerate(beginings), key=lambda tup: len(tup[1]))[0]


def distance(x, y):
    return abs(x - y)


aligned_list = np.full((len(beginings), num_col), np.nan)
reference = beginings[index_longest_list]

for row_pos, line in enumerate(beginings):
    for start in line:
        distances = []
        for col_pos, j in enumerate(reference):
            distances.append(distance(start, j))
        index = np.argmin(distances)
        while not np.isnan(aligned_list[row_pos, index]):
            previous_value = aligned_list[row_pos, index]
            if start > previous_value:
                index += 1
            elif start <= previous_value:
                index -= 1
        if np.isnan(aligned_list[row_pos, index]):
            aligned_list[row_pos, index] = start

print(aligned_list)