3

I am trying to convert a very clean PDF file into txt file using python. I have tried using pyPDF2 and PDFMiner, both worked perfectly in text recognition.

However, as in PDF the lines are wrapped, the extracted .txt file have unintended line break at the end: e.g line 1: "is an account of the Elder /n Days, ". There should not be a line break between the "Elder" and the "days".

txt file

The PDF file: enter image description here

When edited with Acrobat, it can be clearly seen the original text in PDF contains no hard line break, and could be edited as a paragraph instead of single lines. enter image description here

The Code I have tried (adapted from an answer from here: convert from pdf to text: lines and words are broken)

import io as io
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt

#converts pdf, returns its text content as a string
def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = io.StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

path='D:\Folder\File.pdf'
a=convert(path)
f=open("D:\Folder\File.txt",'a',encoding='utf-8')
f.write(a)
f.close()
C.Ann.Sng
  • 63
  • 4
  • 1
    The .pdf file itself is formatted that way? In line 1 for example, you can clearly see the line break from "Elder" to "Days". – J. M. Arnold May 26 '21 at 15:50
  • @Yes, the .pdf file is in presented in this way because any given paragraph needs to end somewhere instead of appear as a very long single line. However, when I try to edit this on Acrobat, it actually returns as a paragraph, instead of single lines. Hence I'm pretty sure the line is just wrapped, not broken. – C.Ann.Sng May 26 '21 at 15:56
  • 1
    Can you simply strip single line breaks as it appears as though you want to keep double line breaks? – JonSG May 26 '21 at 16:05
  • @JonSG Thanks for commenting. There are other single line breaks in the doc that I want to keep, so I'm looking for a universal solution. Also I saw many discussions online on how to extract from PDF without the line breaks and it seem to be a long existing problem for many. So I think it would be of interest to others too. There are softwares that enables this feature, but I'm wondering whether it could be achieved on python. – C.Ann.Sng May 26 '21 at 16:12
  • I'm better we can do something. Python aside, how would you determine when to keep or discard a line break? – JonSG May 26 '21 at 16:15
  • @JonSG So let's say I created 2 word docs, A.doc with a long paragraph that only appear as multiple lines because of line wrapping, and B.doc where I hard broke the paragraph into separate lines by pressing 'enter' (which will give me a line break). After that I saved both A and B as pdf. I am hoping to extract text and get back the same from A.pdf and B.pdf - A as a paragraph and B as multiple lines. I've seen one solution online - to export PDF from Acrobat Pro to html, so that there are no line breaks in A.htm. But I'm wondering if it can be done in python, as not all PCs have Acrobat Pro – C.Ann.Sng May 26 '21 at 16:29
  • But aren't one-line breaks in the original document equal to two line-breaks? And the ending of a normal line transfers to a one-line break (the ones you'd like to remove). Thus by stripping one line-break per line should solve the issue, doesn't it? – J. M. Arnold May 26 '21 at 17:33
  • @J.M.Arnold. I understand what you mean. I've only shown a small portion of a 180+ page PDF. There are portion where it doesn't follow this "1 line breaks in the original document equal to 2 line-break" pattern, where there 1 line break might equal to 1 line break or more line breaks. ), as they have different spacing between paragraphs. – C.Ann.Sng May 26 '21 at 17:45
  • 1
    You are misdiagnosing what Acrobat is doing. A PDF file is just a capture of a printout. It does not have the concept of paragraphs and sentences, nor does it do any wordwrapping. It's just "print this string at this X,Y coordinate", for each line. There definitely ARE line breaks at the end of each line. Acrobat is just guessing that those lines make up a paragraph because of the positioning. You need to do the same thing, manually. – Tim Roberts May 26 '21 at 23:55

1 Answers1

2

"A picture is worth a thousand words" and comments do not allow pictures ! I am using the Web archive of a different copy but the Gist is exactly the same.

You are working with "justified" content but unlike reflowing justification in a word processor, the glyphs are generally tied to a line of a set position up from the page base. Adobe are working on reflowable PDFs and have the expertise to combine lines in a paragraph, however other readers will accept</br>
each line for what it is. </br>

<p stle=indented>There are no paragraph boundary markers, like there is in say HTML <\p>

Readers could in the future be augmented like acrobat, to combine the lines, but its not needed for reading (aloud) one line at a time. Some audio readers will noticeably stutter on those "line at a time" returns, whilst some are intelligently programmed to simply ignore them.

enter image description here

In short you need to add your own AI/regex to gather lines and add indents, but beware significant human literature differences such as hyphenation and oriental punctuation.

K J
  • 8,045
  • 3
  • 14
  • 36