I have a set of letters that are positioned on the plane (for each letter I know the coordinates of its corner points, and strings can be treated as parallelograms). I know that strings form a table, but I don't know neither how many rows or columns the table has, nor the sizes of the cells. Additionally the letters and the table have following properties
- Letters are composed into words
- Each word has either a whitespace at the end, or is the last word in its row
- Each cell of the table either contains one string (which can consist of multiple words) or is completely empty
- The gap between words in a string can be larger or smaller than the width of the whitespace.
- The letters in a single table row don't all have the same Y coordinates, but tilt is never too large (i.e. the Y coordinates of every corner of every letter in a given row are higher than the Y coordinates of every corner of every letter in the next row)
- The cell never spans two rows
- Some cells span multiple columns - in that case the contents of that cell can be put in any of the spanned columns, preferably, but not necesserily, consistently (i.e. always in the leftmost or always in the rightmost spanned column)
- Empty cells don't have any letters in them (i.e. there's no "whitespace" letter that can be used to identify empty cells)
- Contents of every cell in the column are usually either all aligned to the left, or all aligned to the right, though they don't always have the same leftmost X coordinate, and contents of some cells don't follow the alignments of their column (instead they aren't aligned at all, so both edges of the string are somewhere in the middle of the cell).
- The actual contents of the strings are not helpful in determining the structure of the table.
So the question is, given the set of letters and their coordinates, how can I properly divide them into a table?
For example:
Id | Name | Number 1 | Number 2 | Number 3 |
---|---|---|---|---|
GJ32 | Pablo Diego José Francisco de Paula Juan Nepomuceno María de los Remedios Cipriano | 24.5 | 443423 | 332.68 |
G33!!:L | Jane D0~ | 24 17 | 44:!4O | .68 |
** | Bob Sm1th | 34,7 | I.oo | soo |
Let's say letters form the table like this, but I only know the letters themselves and their coordinates, but not size of the table.
Specifically I'm parsing a PDF file consisting of scans of tables, and trying to convert those scans into Excel table (one Excel sheet per PDF page), but there are a lot of issues with how PDF file in question is built:
- Text is generated with the text recognition software, and there are often issues such as missing decimal points, misrecognized symbols (with no reliable 1-to-1 correspondences). As a result I can't rely on the contents of the strings
- The text is not always drawn in the reading order, moreover the order it's drawn changes from page to page
- The words and strings are not always drawn together, often PDF shows a part of the word, adjusts text matrix, and shows another part of the same word
- There are multiple types of tables mixed in the document.
- Sometimes a table spans multiple pages, so that only one page actually contains a header row. And the alignments of the tables don't match, so even if I was able to find a header row (which is problematic because of the first issue), I couldn't use its coordinates to properly group the strings on another page
I'm using C# and PdfPig, but the question is language agnostic for the most part