Extract author names in the PDF using Python

Question

I have multiple PDF files where I need to extract the author names. I need to extract the author names only from the first page of the PDF file and ignore all the other pages.

I have multiple PDF files which the same format where I need to extract the author names.

Here is the link for PDF pdf file

Below is the image where the first page of PDF looks like

I need to extract the author names which is in bold color. I am using the below code to extract

import PyPDF2
import re
file = 'pdf_file'
reader = PyPDF2.PdfReader(file)
page = reader.pages[0]
pdf_text_from_paper = page.extract_text()
emails_pattern  = r"\{([^}]+)\}"
email_matches = re.findall(emails_pattern, pdf_text_from_paper)

I could able to extract the emails but not the names. Can anyone tell on how to extract the names?

I am collecting these author names so that I can use these for ML training — merkle, Mar 26 '23 at 06:21
PyPDF2 is deprecated. Use pypdf (I'm the maintainer of both) — Martin Thoma, Mar 26 '23 at 08:14

Alexander · Accepted Answer · 2023-03-26T14:54:36.060

1

I am not certain that this will work for all of your pdfs, but this at least works for the one you linked to in your question and if they are all the same format then it could work on the others as well.:

pattern = re.compile(r'\s{4}(?!Introduction)(\w+\s\w*?\.?\s?\w*?)\s{2}')
matches = pattern.findall(page)
print(matches)

output

['Jiaheng Xie', 'Xiao Liu', 'Daniel Dajun Zeng', 'Xiao Fang']

EDIT

This pattern works on both pdfs you linked to.

pattern = re.compile(r'1\s+?(\w+\s\w*?\.?\s?\w*?)\s*?\n|\{\w+?.*?@.*?\}\s+?(\w+\s\w*?\.?\s?\w*?)\s*?\n')
for doc in ["document1.pdf", "document2.pdf"]:
    reader = PyPDF2.PdfReader(doc)
    page = reader.pages[0]
    text = page.extract_text()
    matches = pattern.findall(text)
    print([j for i in matches for j in i if j])

OUTPUT:

['Jiaheng Xie', 'Xiao Liu', 'Daniel Dajun Zeng', 'Xiao Fang']
['Honglin Deng', 'Weiquan W ang', 'Siyuan Li', 'Kai H. Lim']

edited Mar 26 '23 at 14:54

answered Mar 26 '23 at 07:17

Alexander

16,091
5
13
29

I'm certain that this will only work for some PDFs :-) – Martin Thoma Mar 26 '23 at 08:11
The same code is not working for this pdf: https://purple-katuscha-96.tiiny.site – merkle Mar 26 '23 at 09:59
How do you use this word "Introduction"? Is this word appeared in pdf? – merkle Mar 26 '23 at 14:05
Yes... It also matched the pattern so that excluded it from the list of matches. – Alexander Mar 26 '23 at 14:29
@merkle Did you see the edit? – Alexander Mar 26 '23 at 22:09

Martin Thoma · Answer 2 · 2023-03-26T10:03:58.557

0

PDF files have a metadata field. It might not be filled, but maybe you're lucky sometimes:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
print(reader.metadata.author)

See the pypdf docs on metadata

If the authors names are not set in the metadata field, you need machine learning to make this work for a more general case. The sub-field is called Natural Language Processing (NLP). You would start by extracting the text:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""

# potentially you only want the first page, but it's
# not guaranteed that the authors will always be on
# the first page.
for page in reader.pages:
    text += page.extract_text() + "\n"

Then you can ignore the PDF parts. It's only about recognizing the right parts of the text. That is a different question, e.g.:

edited Mar 26 '23 at 10:03

answered Mar 26 '23 at 08:14

Martin Thoma

124,992
159
614
958

This is not working Martin – merkle Mar 26 '23 at 08:56
It might have an empty result for some PDFs (if the PDF creator didn't set that metadata field), but I can assure you it is working in general. – Martin Thoma Mar 26 '23 at 09:39
The PDF creator haven't set the metadata field. Is there any way to get the author names? – merkle Mar 26 '23 at 09:53
Just by parsing the content then. And that is very unlikely to work well with simple regexes. That is rather a machine learning task. I'll expand my answer a little bit. – Martin Thoma Mar 26 '23 at 10:00
Here is the link for pdf: https://purple-katuscha-96.tiiny.site – merkle Mar 26 '23 at 10:00

K J · Answer 3 · 2023-03-26T16:52:09.103

Usually the simplest when a page has a known layout is to export the plain text to a file in fixed area[s]

Here using console (without file redirection) which you can do for extracting the area with four names. However, you could extract four separate entries of name only into each file line.

The problem then is that its 4 this time, but could be 3 or 2 or 1, thus different extractors would need different profiles, which can be run in parallel then auto select the best run.

In this case by exclude lines with email symblol @ and exclude UNI or SCHOOL etc.

pdftotext -f 1 -l 1 -nopgbrk -layout -margint 190 -marginb 400 "document (Author).pdf" -|findstr /i /v "@ uni school"

If you want the email the keep the @

Once you have your clean extraction then its easier to 2nd process with regex or simpler means as the lines are now split

Extract author names in the PDF using Python

3 Answers3

EDIT