how to extract title from a pdf documment with R

Question

I need help to extract information from a pdf file in r (for example https://arxiv.org/pdf/1701.07008.pdf)

I'm using pdftools, but sometimes pdf_info() doesn't work and in that case I can't manage to do it automatically with pdf_text()

NB notice that tabulizer didn't work on my PC.
Here is the treatment I'm doing (Sorry you need to save the pdf and do it with your own path):

 info <- pdf_info(paste0(path_folder,"/",pdf_path))
 title <- c(title,info$keys$Title)
 key <- c(key,info$keys$Keywords)
 auth <- c(auth,info$keys$Author)
 dom <- c(dom,info$keys$Subject)
 metadata <- c(metadata,info$metadata)

I would like to get title and abstract most of the time.

Arxiv also has non-pdf pages of the abstracts: https://arxiv.org/abs/1701.07008 — G. Grothendieck, Sep 03 '19 at 12:46
very true , but that was for the exemple , in my problem let's asume we have only pdf (because it's specify in usecase ) — Jérémy, Sep 03 '19 at 12:59

score 2 · Accepted Answer · answered Sep 05 '19 at 04:41

We will need to make some assumptions about the structure of the pdf we wish to scrape. The code below makes the following assumptions:

Title and abstract are on page 1 (fair assumption?)
Title is of height 15
The abstract is between the first occurrence of the word "Abstract" and first occurrence of the word "Introduction"

library(tidyverse)
library(pdftools)

data = pdf_data("~/Desktop/scrape.pdf")

#Get First page
page_1 = data[[1]]

# Get Title, here we assume its of size 15
title = page_1%>%
  filter(height == 15)%>%
  .$text%>%
  paste0(collapse = " ")


#Get Abstract
abstract_start = which(page_1$text == "Abstract.")[1]
introduction_start = which(page_1$text == "Introduction")[1]

abstract = page_1$text[abstract_start:(introduction_start-2)]%>%
  paste0(collapse = " ")

You can, of course, work off of this and impose stricter constraints for your scraper.

it work but i can,t make goodenoughth assemption to work all along the dataset — Jérémy, Sep 05 '19 at 09:01

how to extract title from a pdf documment with R

1 Answers1