-1

`I am trying to extract text from pdf file which consists of text, tables, and images. and want to save that file on local system. This was the code i was developing.

from PyPDF2 import PdfFileReader
# Load the pdf to the PdfFileReader object with default settings
with open("SHKelkar.pdf", "rb") as pdf_file:
    pdf_reader = PdfFileReader(pdf_file)
    total_pages = pdf_reader.numPages
    print(total_pages)
    print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")
    for i in range(total_pages):
        page = pdf_file.page[i]
        textdata = page.extract_text()
        print(textdata)
netha
  • 1
  • 1

1 Answers1

0

you are extracting from pdf_file instead of pdf_reader:

check this below working code.

from PyPDF2 import PdfFileReader
# Load the pdf to the PdfFileReader object with default settings
with open("sample.pdf", "rb") as pdf_file:
    pdf_reader = PdfFileReader(pdf_file)
    total_pages = pdf_reader.getNumPages()
    print(total_pages)
    print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")
    for i in range(total_pages):
        page = pdf_reader.getPage(i)
        textdata = page.extractText()
        print(textdata)
Tasnuva Leeya
  • 2,515
  • 1
  • 13
  • 21
  • How to create CDQA data set ? For example i extracted data from pdf. from that extracted text data how can we create a CDQA dataset ? How to generate data like CDQA? – netha Nov 02 '20 at 17:18
  • Does above answer solved your problem? and for cdqa you can check this blog https://towardsdatascience.com/how-to-create-your-own-question-answering-system-easily-with-python-2ef8abc8eb5 – Tasnuva Leeya Nov 04 '20 at 11:25