I have multiple PDF files where I need to extract the author names. I need to extract the author names only from the first page of the PDF file and ignore all the other pages.
I have multiple PDF files which the same format where I need to extract the author names.
Here is the link for PDF pdf file
Below is the image where the first page of PDF looks like
I need to extract the author names which is in bold color. I am using the below code to extract
import PyPDF2
import re
file = 'pdf_file'
reader = PyPDF2.PdfReader(file)
page = reader.pages[0]
pdf_text_from_paper = page.extract_text()
emails_pattern = r"\{([^}]+)\}"
email_matches = re.findall(emails_pattern, pdf_text_from_paper)
I could able to extract the emails but not the names. Can anyone tell on how to extract the names?