Clustering a set of letters into a table by position

Question

I have a set of letters that are positioned on the plane (for each letter I know the coordinates of its corner points, and strings can be treated as parallelograms). I know that strings form a table, but I don't know neither how many rows or columns the table has, nor the sizes of the cells. Additionally the letters and the table have following properties

Letters are composed into words
Each word has either a whitespace at the end, or is the last word in its row
Each cell of the table either contains one string (which can consist of multiple words) or is completely empty
The gap between words in a string can be larger or smaller than the width of the whitespace.
The letters in a single table row don't all have the same Y coordinates, but tilt is never too large (i.e. the Y coordinates of every corner of every letter in a given row are higher than the Y coordinates of every corner of every letter in the next row)
The cell never spans two rows
Some cells span multiple columns - in that case the contents of that cell can be put in any of the spanned columns, preferably, but not necesserily, consistently (i.e. always in the leftmost or always in the rightmost spanned column)
Empty cells don't have any letters in them (i.e. there's no "whitespace" letter that can be used to identify empty cells)
Contents of every cell in the column are usually either all aligned to the left, or all aligned to the right, though they don't always have the same leftmost X coordinate, and contents of some cells don't follow the alignments of their column (instead they aren't aligned at all, so both edges of the string are somewhere in the middle of the cell).
The actual contents of the strings are not helpful in determining the structure of the table.

So the question is, given the set of letters and their coordinates, how can I properly divide them into a table?

For example:

Id	Name	Number 1	Number 2	Number 3
GJ32	Pablo Diego José Francisco de Paula Juan Nepomuceno María de los Remedios Cipriano	24.5	443423	332.68
G33!!:L	Jane D0~	24 17	44:!4O	.68
**	Bob Sm1th	34,7	I.oo	soo

Let's say letters form the table like this, but I only know the letters themselves and their coordinates, but not size of the table.

Specifically I'm parsing a PDF file consisting of scans of tables, and trying to convert those scans into Excel table (one Excel sheet per PDF page), but there are a lot of issues with how PDF file in question is built:

Text is generated with the text recognition software, and there are often issues such as missing decimal points, misrecognized symbols (with no reliable 1-to-1 correspondences). As a result I can't rely on the contents of the strings
The text is not always drawn in the reading order, moreover the order it's drawn changes from page to page
The words and strings are not always drawn together, often PDF shows a part of the word, adjusts text matrix, and shows another part of the same word
There are multiple types of tables mixed in the document.
Sometimes a table spans multiple pages, so that only one page actually contains a header row. And the alignments of the tables don't match, so even if I was able to find a header row (which is problematic because of the first issue), I couldn't use its coordinates to properly group the strings on another page

I'm using C# and PdfPig, but the question is language agnostic for the most part

@KJ the task here is creating a system for automatic parsing, any solution that requires human input isn't available. Furthermore we aren't allowed to install any unauthorized software, and we don't have authorizations for any PDF editors — JohnDiGriz, Mar 20 '23 at 22:11

Clustering a set of letters into a table by position

0 Answers0