1

I have multiple PDF files where I need to extract the author names. I need to extract the author names only from the first page of the PDF file and ignore all the other pages.

I have multiple PDF files which the same format where I need to extract the author names.

Here is the link for PDF pdf file

Below is the image where the first page of PDF looks like

enter image description here

I need to extract the author names which is in bold color. I am using the below code to extract

import PyPDF2
import re
file = 'pdf_file'
reader = PyPDF2.PdfReader(file)
page = reader.pages[0]
pdf_text_from_paper = page.extract_text()
emails_pattern  = r"\{([^}]+)\}"
email_matches = re.findall(emails_pattern, pdf_text_from_paper)

I could able to extract the emails but not the names. Can anyone tell on how to extract the names?

Hasan Haghniya
  • 2,347
  • 4
  • 19
  • 29
merkle
  • 1,585
  • 4
  • 18
  • 33

3 Answers3

1

I am not certain that this will work for all of your pdfs, but this at least works for the one you linked to in your question and if they are all the same format then it could work on the others as well.:

pattern = re.compile(r'\s{4}(?!Introduction)(\w+\s\w*?\.?\s?\w*?)\s{2}')
matches = pattern.findall(page)
print(matches)

output

['Jiaheng Xie', 'Xiao Liu', 'Daniel Dajun Zeng', 'Xiao Fang']

EDIT

This pattern works on both pdfs you linked to.

pattern = re.compile(r'1\s+?(\w+\s\w*?\.?\s?\w*?)\s*?\n|\{\w+?.*?@.*?\}\s+?(\w+\s\w*?\.?\s?\w*?)\s*?\n')
for doc in ["document1.pdf", "document2.pdf"]:
    reader = PyPDF2.PdfReader(doc)
    page = reader.pages[0]
    text = page.extract_text()
    matches = pattern.findall(text)
    print([j for i in matches for j in i if j])

OUTPUT:

['Jiaheng Xie', 'Xiao Liu', 'Daniel Dajun Zeng', 'Xiao Fang']
['Honglin Deng', 'Weiquan W ang', 'Siyuan Li', 'Kai H. Lim']
Alexander
  • 16,091
  • 5
  • 13
  • 29
0

PDF files have a metadata field. It might not be filled, but maybe you're lucky sometimes:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
print(reader.metadata.author)

See the pypdf docs on metadata

If the authors names are not set in the metadata field, you need machine learning to make this work for a more general case. The sub-field is called Natural Language Processing (NLP). You would start by extracting the text:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""

# potentially you only want the first page, but it's
# not guaranteed that the authors will always be on
# the first page.
for page in reader.pages:
    text += page.extract_text() + "\n"

Then you can ignore the PDF parts. It's only about recognizing the right parts of the text. That is a different question, e.g.:

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
0

Usually the simplest when a page has a known layout is to export the plain text to a file in fixed area[s]

Here using console (without file redirection) which you can do for extracting the area with four names. However, you could extract four separate entries of name only into each file line.

The problem then is that its 4 this time, but could be 3 or 2 or 1, thus different extractors would need different profiles, which can be run in parallel then auto select the best run.

enter image description here

In this case by exclude lines with email symblol @ and exclude UNI or SCHOOL etc.

enter image description here

pdftotext -f 1 -l 1 -nopgbrk -layout -margint 190 -marginb 400 "document (Author).pdf" -|findstr /i /v "@ uni school"

If you want the email the keep the @ enter image description here

Once you have your clean extraction then its easier to 2nd process with regex or simpler means as the lines are now split

K J
  • 8,045
  • 3
  • 14
  • 36