I am using pdftotext python package to extract text from pdf however I need to remove headers and footers from the text file to extract only the content.
There could be two ways to solve this :
- Using regular expressions in text file
- Using some filter while getting text from pdf
Now, the current problem is headers and footers being inconsistent with pages.
For example, the first 1-2 lines of header might have contractor's address which is consistent but 3rd line of the header has section and the topic which the page is following. Similarly footer consists of project number(not a fixed number value either), subsection number and some design words followed by a date which should be consistent (but different for every project). It should also be noted that the pdf file can be 500+ pages for every project but probably splitting will be done based on sections.
Currently I'm using this code to extract information. Are there any parameters I don't know about which can be used to remove headers and footers?
import pdftotext
def get_data(pdf_path):
with open(pdf_path, "rb") as f:
pdf = pdftotext.PDF(f)
print("Pages : ",len(pdf))
with open('text-pdftotext.txt', 'w') as k:
k.write("\n\n".join(pdf))
f.close()
k.close()
get_data('specification_file.pdf')