Multi-column layout handling with pdfminer pdf2txt.py module

Asked May 27 '13 at 14:52

Active May 27 '13 at 18:20

Viewed 1,118 times

So far I am using pdfminer pdf2txt.py module with success.

But a problem arises in pdf files formatted in two columns. The module retrieves text into a single column which results into many split words, at the end of lines. example:

and functional properties of cellu-
lar components negatively, both physically and chemically.

*Note that the words are separated by the '-' character.

What I want is to customize the command in order for the words, in the end of the line, to appear as a whole and therefore do not lose information. Probably by adding a line parameter or a character margin, specific for '-' character to be replaced by a backslash?

I would also like to know if there is way to loop the command and make it parse a directory full of pdf files, each time generating a different output text file named after the original?

I am not sure how to do it though.

edited May 27 '13 at 18:20

asked May 27 '13 at 14:52

Multi-column layout handling with pdfminer pdf2txt.py module

0 Answers0