I have a list of list containing the start position of each column in an OCR´d tabular table.
[[16, 102, 119, 136],
[16, 48, 76, 109, 145],
[16, 47, 75, 108, 128, 145],
[16, 48, 77, 110, 141],
[98, 135]]
The initial idea is to use the longest list as a reference to align the others by similarity. Conceptually is like a fuzzy join but only 1 match for each value is permitted (at least 1 match and at most 1 match).
How can I get from irregular input list to this expected output?
[[16, '', '', 102, 119, 136],
[16, 48, 76, 109, '', 145],
[16, 47, 75, 108, 128, 145],
[16, 48, 77, 110, '', 141],
['', '', '', 98, '', 135]]
Global target is to put that string into a dataframe, I am provinding that in case any other approach is proposed. As you can see it has missing headers, and missing cells, so I had the aforementioned idea in ordet to split each string common positions later into a csv.
Cuentas a la banca INTERES DIVISA EUR
CUENTA CORRIENTE EMPRESAS 0000 0000 000000000000 EUR 0,00 % 0.00
CUTRECUENTA EMPRESAS 0000 0000 000000000000 USD 0.00 % 00.00 00.00
CUENTA CORRIENTE EMPRESAS 0000 0000 000000000000 EUR 0.00% 00 000.00
TOTAL 00 000,00