2

I need help to extract information from a pdf file in r (for example https://arxiv.org/pdf/1701.07008.pdf)

I'm using pdftools, but sometimes pdf_info() doesn't work and in that case I can't manage to do it automatically with pdf_text()

NB notice that tabulizer didn't work on my PC.
Here is the treatment I'm doing (Sorry you need to save the pdf and do it with your own path):

 info <- pdf_info(paste0(path_folder,"/",pdf_path))
 title <- c(title,info$keys$Title)
 key <- c(key,info$keys$Keywords)
 auth <- c(auth,info$keys$Author)
 dom <- c(dom,info$keys$Subject)
 metadata <- c(metadata,info$metadata)

I would like to get title and abstract most of the time.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Jérémy
  • 340
  • 1
  • 3
  • 13

1 Answers1

2

We will need to make some assumptions about the structure of the pdf we wish to scrape. The code below makes the following assumptions:

  1. Title and abstract are on page 1 (fair assumption?)
  2. Title is of height 15
  3. The abstract is between the first occurrence of the word "Abstract" and first occurrence of the word "Introduction"
library(tidyverse)
library(pdftools)

data = pdf_data("~/Desktop/scrape.pdf")

#Get First page
page_1 = data[[1]]

# Get Title, here we assume its of size 15
title = page_1%>%
  filter(height == 15)%>%
  .$text%>%
  paste0(collapse = " ")


#Get Abstract
abstract_start = which(page_1$text == "Abstract.")[1]
introduction_start = which(page_1$text == "Introduction")[1]

abstract = page_1$text[abstract_start:(introduction_start-2)]%>%
  paste0(collapse = " ")

You can, of course, work off of this and impose stricter constraints for your scraper.

Sada93
  • 2,785
  • 1
  • 10
  • 21