tabola library for extracting text from pdf in Python by area

Question

I am trying tabula and I am selecting text by area but some areas change between documents and I got some mismatched results. Check the images for a clearer explanation.

Large size "Discriminação dos Serviços"	Small size "Discriminação dos Serviços"

What are the alternatives for this kind of comportment in pdf files?

score 1 · Answer 1 · answered Jun 08 '22 at 17:23

1

If it's only two different sizes, maintain two sets of location data, and have a single piece of text you look for that tells you which size it is, like:

Código do Serviço / Atividade

(I picked that text because, when looking at them side by side, it's the first text I could identify that had different locations.)

If the "lower" location matches, then it's the bigger of the two, and you will use the "large" location set.

answered Jun 08 '22 at 17:23

Zach Young

10,137
4
32
53

there are multiple sizes. I think about running a for loop varying vertical position but that would result in slower processing. Any tips or suggestions? – Lucas Dadalt Jun 08 '22 at 19:28
Hmm. At least in the large vs small sample you shared, the only difference is that one section, _Código do Serviço / Atividade_. So, can you use fixed locations for the top portion, find _Código..._, then make the locations in the bottom portion relative (offset) from _ _Código..._? Can you extend that idea to the other sizes? – Zach Young Jun 08 '22 at 23:29
Yes. The area where "Discriminação dos Serviços" begin is always the same. I will start from there and looping vertically until reach "Código do Serviço / Atividade". Then when it find that section, I save the position and find the other section from there. – Lucas Dadalt Jun 10 '22 at 23:21
This way will be slower. That's why I was thinking if would exist any other solution.. – Lucas Dadalt Jun 10 '22 at 23:21
I just answered another question about finding text in a PDF, comparing PDFQuery (based on pdfminer) and PyMuPDF. It's not quite the same problem, but you might want to take a look at the PyMuPDF code, because that shows you how to very quickly iterate through all the text objects. – Zach Young Jun 10 '22 at 23:31

tabola library for extracting text from pdf in Python by area

1 Answers1