I want to classify and analyze chapters and subchapters from a book in PDF format. So count the number of words and examine which word occurs how often and in which chapter.
pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader
# Creating a pdf file object
pdf = open('C:/Users/Dominik/Desktop/bsc/pdf1.pdf',"rb")
# creating pdf reader object
pdf_reader = PyPDF2.PdfFileReader(pdf)
# checking number of pages in a pdf file
print(pdf_reader.numPages)
print(pdf_reader.getDocumentInfo())
# creating a page object
page = pdf_reader.getPage(0)
# finally extracting text from the page
print(page.extractText())
# Extracting entire PDF
for i in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(i)
a = str(1+pdf_reader.getPageNumber(page))
print (a)
page_content = page.extractText()
print (page_content)
# closing the pdf file
pdf.close()
this code already works. now I want to do more analysis like
- store each chapter in its own variable and count the number of words. In the end, everything should be stored in an excel file.