I have pdf text data which is read using pdftotext in python.
How can I convert this data into correct sequence data text so that I can extract the text from string sequentially. I want to convert this 2 column data into single column data.
Example of text:-
-
With reference to Stone Age, consider the 4. With reference to Vedic Age, consider the following statements: following statements: 1. Microliths are tiny stone artifacts 1. The Aranyakas deal with mysticism, belonging to Middle Stone Age. rites, rituals and sacrifices. 2. The use of bow and arrow began during 2. Child marriage and practice of sati was the Old Stone Age prevelant during the Rig Vedic Period. 3. Lakhudiyar caves of Uttrakhand bear 3. Nishka,Satamana and Krishnala were the famous pre-historic cave paintings types of coins used as medium of of wavy lines and hand-linked dancing exchange. figures Which of the statements given above are correct? Which of the statements given above are (a) 1 and 2 only correct? (b) 2 and 3 only (a) 1 and 2 only (c) 1 and 3 only (b) 2 and 3 only (d) 1,2 and 3 (c) 1 and 3 only (d) 1, 2 and 3
Below is the code to read pdf.
def extract_text_from_pdf(pdf_path):
text = ""
# Load your PDF
with open(pdf_path, "rb") as f:
pdf = pdftotext.PDF(f)
return pdf