From pdf text to tidy dataframe with file names in document column

Question

I want to analyse text from almost 300 pdf documents. Now I used the pdftools and tm, tidytext packages to read the text, coverted it to a corpus, then to a document-term-matrix and I finally want to structure it in a tidy dataframe.

I've got a couple questions:

How do I get rid of page data (at the top and/or bottom of every pdf page)
I would rather want the filenames as the values in the document column instead of indexed numbers.
The following code contents only 2 pdf files for reproducibility. When I run all my files I get 294 documents in my corpus object, but when I tidy it I seem to loose some files because converted %>% distinct(document) gives 275 back. I wonder why that is.

I've got the The following reproducible script:

library(tidyverse)
library(tidytext)
library(pdftools)
library(tm)
library(broom)

# Create a temporary empty directory 
# (don't worry at the end of this script I'll remove this directory and its files)

dir.create("~/Desktop/sample-pdfs")

# Fill directory with 2 pdf files from my github repo

download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/'s-Gravenhage_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/'s-Gravenhage_coalitieakkoord.pdf")
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/Aa%20en%20Hunze_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/Aa en Hunze_coalitieakkoord.pdf")

# Create vector of file paths

dir <- "~/Desktop/sample-pdfs"
pdfs <- paste(dir, "/", list.files(dir, pattern = "*.pdf"), sep = "")

# Read the text from pdf's with pdftools package

pdfs_text <- map(pdfs, pdf_text)

# Convert to document-term-matrix

converted <- Corpus(VectorSource(pdfs_text)) %>%
          DocumentTermMatrix()

# Now I want to convert this to a tidy format

converted %>%
          tidy() %>%
          filter(!grepl("[0-9]+", term))

With the following output:

# A tibble: 5,305 x 3
   document term           count
   <chr>    <chr>          <dbl>
 1 1        aan              158
 2 1        aanbesteding       2
 3 1        aanbestedingen     1
 4 1        aanbevelingen      1
 5 1        aanbieden          3
 6 1        aanbieders         1
 7 1        aanbod             8
 8 1        aandacht          16
 9 1        aandachtspunt      3
10 1        aandeel            1
# ... with 5,295 more rows

This seems to work out nicely but I would rather want the filenames ("'s-Gravenhage" and "Aa en Hunze") as the values in the document column instead of indexed numbers. How do I do this?

Desired output:

# A tibble: 5,305 x 3
   document      term           count
   <chr>         <chr>          <dbl>
 1 's-Gravenhage aan              158
 2 's-Gravenhage aanbesteding       2
 3 's-Gravenhage aanbestedingen     1
 4 's-Gravenhage aanbevelingen      1
 5 's-Gravenhage aanbieden          3
 6 's-Gravenhage aanbieders         1
 7 's-Gravenhage aanbod             8
 8 's-Gravenhage aandacht          16
 9 's-Gravenhage aandachtspunt      3
10 's-Gravenhage aandeel            1
# ... with 5,295 more rows

Delete downloaded files and its directory from desktop running the following line:

unlink("~/Desktop/sample-pdfs", recursive = TRUE)

All help is much appreciated!

score 3 · Accepted Answer · answered Aug 16 '18 at 17:31

You can read the documents straight into a corpus with tm. the reader readPDF uses pdftools as an engine. No need to first create a set of text, put it through a corpus to get your output. I created 2 examples. The first one in line with what you were doing, but first going through a corpus. The second purely based on tidyverse + tidytext. No need for switching between tm, tidytext etc.

The differences in number of tokens between the examples is due to automatic cleaning in tidytext / tokenizer.

If you have a lot of documents to do, you might want to use quanteda to be your workhorse as that one can work on multiple cores out of the box and might speed up the tokenizer part. Don't forget to use the stopwords package for getting a good list of dutch stopwords. If you need POS tagging for Dutch words, you check the updipe package.

library(tidyverse)
library(tidytext)
library(tm)

directory <- "D:/sample-pdfs"

# create corpus from pdfs
converted <- VCorpus(DirSource(directory), readerControl = list(reader = readPDF)) %>% 
  DocumentTermMatrix()


converted %>%
  tidy() %>%
  filter(!grepl("[0-9]+", term))

# A tibble: 5,707 x 3
   document                          term           count
   <chr>                             <chr>          <dbl>
 1 's-Gravenhage_coalitieakkoord.pdf "\ade"             4
 2 's-Gravenhage_coalitieakkoord.pdf "\adeze"           1
 3 's-Gravenhage_coalitieakkoord.pdf "\aeen"            2
 4 's-Gravenhage_coalitieakkoord.pdf "\aer"             2
 5 's-Gravenhage_coalitieakkoord.pdf "\aextra"          2
 6 's-Gravenhage_coalitieakkoord.pdf "\agroei"          1
 7 's-Gravenhage_coalitieakkoord.pdf "\ahet"            1
 8 's-Gravenhage_coalitieakkoord.pdf "\amet"            1
 9 's-Gravenhage_coalitieakkoord.pdf "\aonderwijs,"     1
10 's-Gravenhage_coalitieakkoord.pdf "\aop"            11
# ... with 5,697 more rows

Just using tidytext and not tm

directory <- "D:/sample-pdfs"

pdfs <- paste(directory, "/", list.files(directory, pattern = "*.pdf"), sep = "")
pdf_names <- list.files(directory, pattern = "*.pdf")
pdfs_text <- map(pdfs, pdftools::pdf_text)


my_data <- data_frame(document = pdf_names, text = pdfs_text)

my_data %>% 
  unnest %>% # pdfs_text is a list
  unnest_tokens(word, text, strip_numeric = TRUE) %>%  # removing all numbers
  group_by(document, word) %>% 
  summarise(count = n())
# A tibble: 4,646 x 3
# Groups:   document [?]
   document                          word                    count
   <chr>                             <chr>                   <int>
 1 's-Gravenhage_coalitieakkoord.pdf 1e                          2
 2 's-Gravenhage_coalitieakkoord.pdf 2e                          2
 3 's-Gravenhage_coalitieakkoord.pdf 3e                          1
 4 's-Gravenhage_coalitieakkoord.pdf 4e                          1
 5 's-Gravenhage_coalitieakkoord.pdf aan                       164
 6 's-Gravenhage_coalitieakkoord.pdf aanbesteding                2
 7 's-Gravenhage_coalitieakkoord.pdf aanbestedingen              1
 8 's-Gravenhage_coalitieakkoord.pdf aanbestedingsprocedures     1
 9 's-Gravenhage_coalitieakkoord.pdf aanbevelingen               1
10 's-Gravenhage_coalitieakkoord.pdf aanbieden                   4
# ... with 4,636 more rows

Thanks for your answer! Little tip: great thing about `count()` is that it doesn't need the `group_by()` and `summarise()` function. It creates groups in itself. So after `unnest_tokens()` I only need `count(document, word)`. — Tdebeus, Aug 17 '18 at 09:06

Mako212 · Answer 2 · 2018-08-16T22:13:59.313

1

I'd recommend writing a wrapper function for the operations you want to perform, that way you can add each file name as a column.

read_PDF <- function(file){

    pdfs_text <- pdf_text(file)
    converted <- Corpus(VectorSource(pdfs_text)) %>%
          DocumentTermMatrix()
    converted %>%
          tidy() %>%
          filter(!grepl("[0-9]+", term)) %>%

          # add FileName as a column
          mutate(FileName = file)
}

final <- map(pdfs, read_PDF) %>% data.table::rbindlist()

edited Aug 16 '18 at 22:13

answered Aug 16 '18 at 15:21

Mako212

6,787
1
18
37

1

Tnx for your answer. You missed a curly bracket at the start of your function though. – Tdebeus Aug 16 '18 at 22:11

score 1 · Answer 3 · answered Aug 16 '18 at 16:16

Nice example!

I added a few lines to add names.
Not sure about loosing files, I didn't get that behavior.
Just mentioning your file names are not very standard, will recommend checking names again, also you have an apostrophe at the beginning of the first file. Will also recommend cleaning for spaces.
I did my test with English documents, you can add a different language in the corpus.

Here is the code:

library(tidyverse)
library(tidytext)
library(pdftools) 
library(tm)
library(broom)

# Create a temporary empty directory

dir <- "PDFs/"
pdfs <- paste0(dir, list.files(dir, pattern = "*.pdf"))
names <- list.files(dir, pattern = "*.pdf")

# create a table of names
namesDocs <- 
    names %>% 
    str_remove(pattern = ".pdf") %>% 
    as.tibble() %>% 
    mutate(ids = as.character(seq_along(names)))

namesDocs
# Read the text from pdf's with pdftools package

pdfs_text <- map(pdfs, pdftools::pdf_text)

# Convert to document-term-matrix
# add cleaning process

converted <-
    Corpus(VectorSource(pdfs_text)) %>%
    DocumentTermMatrix(
        control = list(removeNumbers = TRUE,
                       stopwords = TRUE,
                       removePunctuation = TRUE))

converted
# Now I want to convert this to a tidy format
# add names of documents

mytable <-
  converted %>%
  tidy() %>%
  arrange(desc(count)) %>% 
  left_join(y = namesDocs, by = c("document" = "ids"))

head(mytable)

View(mytable)

score 0 · Answer 4 · answered Aug 05 '20 at 00:17

I think the easiest i found on the web is from Julien Brun Text minning

You need two packages

library("readtext")
library("quanteda")

For this code, name your PDFs as Author_date, and place them in a folder in your working directory for example, i'm placing my pdf's in PDFs folder

    # set path to the PDF 
pdf_path <- "PDFs/"

# List the PDFs 
pdfs <- list.files(path = pdf_path, pattern = 'pdf$',  full.names = TRUE) 

# Import the PDFs into R
spill_texts <- readtext(pdfs, 
                        docvarsfrom = "filenames", 
                        sep = "_", 
                        docvarnames = c("First_author", "Year"))

# Transform the pdfs into a corpus object
spill_corpus  <- corpus(spill_texts)
spill_corpus
# Some stats about the pdfs
tokenInfo <- summary(spill_corpus)
tokenInfo

From pdf text to tidy dataframe with file names in document column

4 Answers4