So far I am using pdfminer pdf2txt.py module with success.
But a problem arises in pdf files formatted in two columns. The module retrieves text into a single column which results into many split words, at the end of lines. example:
and functional properties of cellu-
lar components negatively, both physically and chemically.
*Note that the words are separated by the '-' character.
What I want is to customize the command in order for the words, in the end of the line, to appear as a whole and therefore do not lose information. Probably by adding a line parameter or a character margin, specific for '-' character to be replaced by a backslash?
I would also like to know if there is way to loop the command and make it parse a directory full of pdf files, each time generating a different output text file named after the original?
I am not sure how to do it though.