Extract text from PDF section keeping strings in one line

Question

I have a bunch of PDF files and I need to extract some information from them. The "section" have the text "Referências" and looks like the picture below:

I tried a lot of text extractor tools to accomplish this task, but the problem is that I need the string to keep in the same line, I don't know if I can explain this correctly, so lets see an example:

I don't want that: I want that:

Hope you understand, sorry about the English thing. Thank you very much.

*"I don't want that: ... I want that: ..."* - But in the pdf we clearly see the former, not the latter, so the connected strings keep in line. Thus, you effectively want to transform the former into the latter, you want to combine certain strings. — mkl, Sep 16 '18 at 13:05
The fact is that the strings have a newline because of the layout width limit, the "new line key (enter)" was not explicitly pressed to break the strings, I wonder if it's possible to convert to text and keep the strings that are one string in one line. — Wolgan Ens, Sep 16 '18 at 18:50
It's completely irrelevant whether some "new line key (enter)" was pressed or not because the input of your program are not those key presses. Instead your input is the pdf, and in that pdf the input string has been split and the parts have been drawn on different lines. In contrast to word processing formats the original long string does not exist anymore, in the pdf there are only the partial strings without any machine readable indication that they somehow belong together. (Unless your pdf is tagged, that is. But you did not mention that it's tagged.) — mkl, Sep 16 '18 at 22:18
Thus, your wish to *"keep the strings that are one string in one line"* does not make sense because the partial strings *are not one string anymore*. — mkl, Sep 16 '18 at 22:21
That's the reason I wrote a QUESTION and wrote I WONDER IF, it's not about something I said make sense or not, I know the file doesn't know about user's key press actions obviously, that was only to give context Thank you for the answer anyway. — Wolgan Ens, Sep 17 '18 at 15:13
Maybe there is a way to keep concatenating the strings while the string occupies the max width possible, It would't be perfect but I would have a lot less work to do manually. — Wolgan Ens, Sep 17 '18 at 22:27
It indeed is possible to customize some text extraction APIs to try and recognize text lines which form paragraphs. Occupation of the whole line widths is one thing to look for, but there are others, too, e.g. slightly greater vertical gaps between paragraphs, punctuation, and capitalization. — mkl, Sep 18 '18 at 07:04
I see, that would be great. I spoke about the whole line occupation because unfortunately there are some pdf files where the different strings have the same vertical gap — Wolgan Ens, Sep 18 '18 at 13:03
You probably should indicate the computer language you are going to use to implement that feature to allow more concrete responses. — mkl, Sep 18 '18 at 16:09
I already have used pdf plumber [pdfplumber](https://github.com/jsvine/pdfplumber) to extract some table data, so python should be fine, but it could really be any language since we can call a subprogram to do the job and then get back to the original program... — Wolgan Ens, Sep 18 '18 at 20:22

Extract text from PDF section keeping strings in one line

0 Answers0