0

I have pdf text data which is read using pdftotext in python.

How can I convert this data into correct sequence data text so that I can extract the text from string sequentially. I want to convert this 2 column data into single column data.

Example of text:-

  1.  With reference to Stone Age, consider the       4.   With reference to Vedic Age, consider the
     following statements:                                following statements:
     1. Microliths are tiny stone artifacts               1. The Aranyakas deal with mysticism,
           belonging to Middle Stone Age.                       rites, rituals and sacrifices.
     2. The use of bow and arrow began during             2. Child marriage and practice of sati was
           the Old Stone Age                                    prevelant during the Rig Vedic Period.
     3. Lakhudiyar caves of Uttrakhand bear               3. Nishka,Satamana and Krishnala were
           the famous pre-historic cave paintings               types of coins used as medium of
           of wavy lines and hand-linked dancing                exchange.
           figures                                        Which of the statements given above are
                                                          correct?
     Which of the statements given above are
                                                          (a) 1 and 2 only
     correct?
                                                          (b) 2 and 3 only
     (a) 1 and 2 only
                                                          (c) 1 and 3 only
     (b) 2 and 3 only
                                                          (d) 1,2 and 3
     (c) 1 and 3 only
     (d) 1, 2 and 3
    

Below is the code to read pdf.

def extract_text_from_pdf(pdf_path):
    text = ""
    # Load your PDF
    with open(pdf_path, "rb") as f:
        pdf = pdftotext.PDF(f)
    return pdf
Granth
  • 325
  • 4
  • 17
  • I am unable to get extract left and extract right.can you please share syntax for any of the solution. – Granth Aug 30 '23 at 08:16

2 Answers2

0

As there is no standard sample I have used https://www.drishtiias.com/images/pdf/February%202022%20(Part-I).pdf as a fairly good mix of issues to consider.

enter image description here

So we do not need page 1 and the pages are split left and right from page 2 to 9.

Thus we need to pull each side in turn but without the top line.

In windows we can write a script, that using cross platform PDFtotext, iterates both sides in turn and collates the pages like this.

You can do similar in any script language or OS shell this script in Windows, this was designed for drag and drop or command line calling as
script.cmd "path/filename.pdf" once you are happy with result add a first line as @echo off

echo/ >"%~dpn1-out.txt"
for /l %%c in (2,1,9) do (
echo Page %%c >>"%~dpn1-out.txt"
%~dp0\pdftotext -nopgbrk -layout -x 0 -y 20 -W 300 -H 820 -f %%c -l %%c -enc UTF-8 "%~f1" "%~dpn1-temp.txt"
copy /b "%~dpn1-out.txt"+"%~dpn1-temp.txt" "%~dpn1-left.txt"
echo/ >>"%~dpn1-left.txt"
%~dp0\pdftotext -layout -x 300 -y 20 -W 300 -H 820 -f %%c -l %%c -enc UTF-8 "%~f1" "%~dpn1-temp.txt"
copy /b "%~dpn1-left.txt"+"%~dpn1-temp.txt" "%~dpn1-out.txt"
echo/ >>"%~dpn1-out.txt"
)
del /q "%~dpn1-temp.txt" & del /q "%~dpn1-left.txt"
pause

The result will be a well laid out text stream as one column (if you do not want the page breaks then add that -nopgbrk switch to second extract same as first.)

enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36
  • How to determine midpoint or -x 300 -y 20 -W 300 -H 820 for each pdf page.I am getting mixed output as midpoint for each page is different. However pdftotext is able to read every page correctly.The only task is to find the midpoint efficiently – Granth Aug 30 '23 at 15:45
0

reading the file with python pdftotext and then split all lines and remove trailing spaces and tabs.

then find max_length between the splits generated above. then mid point in python index is int((max_length+1)/2)

for each split take left and right from the page mid point generated above. Finally, add total left and total right to the output of the final text.

Granth
  • 325
  • 4
  • 17