1

I want to classify and analyze chapters and subchapters from a book in PDF format. So count the number of words and examine which word occurs how often and in which chapter.

pip install PyPDF2

import PyPDF2
from PyPDF2 import PdfFileReader

# Creating a pdf file object
pdf = open('C:/Users/Dominik/Desktop/bsc/pdf1.pdf',"rb")
# creating pdf reader object
pdf_reader = PyPDF2.PdfFileReader(pdf)
# checking number of pages in a pdf file
print(pdf_reader.numPages)
print(pdf_reader.getDocumentInfo())
# creating a page object
page = pdf_reader.getPage(0)
# finally extracting text from the page
print(page.extractText())
# Extracting entire PDF
for i in range(pdf_reader.getNumPages()):
   page = pdf_reader.getPage(i)
   a = str(1+pdf_reader.getPageNumber(page))
   print (a)
   page_content = page.extractText()
   print (page_content)
# closing the pdf file
pdf.close()

this code already works. now I want to do more analysis like

  1. store each chapter in its own variable and count the number of words. In the end, everything should be stored in an excel file.
Wasi
  • 1,473
  • 3
  • 16
  • 32
Dominik Wg
  • 13
  • 1
  • 5
  • 2
    As Niaz Palak explained in his answer, pdfs don't need to contain machine readable information on chapter structure etc. Some pdfs can do, though: tagged pdfs. Are you looking for a generic solution for arbitrary pdfs? Or do you happen to have only tagged pdfs? – mkl Aug 11 '19 at 17:52
  • in general try to classify and analyze the content of pdf files. Currently I have a python script which converts the pdf files into text files. now I trying to build a feed-forward NN to classify the text. – Dominik Wg Aug 12 '19 at 17:35

1 Answers1

2

I tried something similar like this with CVs in PDF format. But all I came to know is the following:

PDF is an unstructured format. It is not possible to extract information from all the PDFs in a structured way. But if you know the structure of the books in PDF format, you can divide the Title of the chapters by using their unique identity like if they are written on BOLD or Italic format. This link can help you extract those information. You can then traverse through the chapter till it hits the next chapter title.

Niaz Palak
  • 175
  • 1
  • 13