I want to analyse text from almost 300 pdf documents. Now I used the pdftools
and tm
, tidytext
packages to read the text, coverted it to a corpus, then to a document-term-matrix and I finally want to structure it in a tidy dataframe.
I've got a couple questions:
- How do I get rid of page data (at the top and/or bottom of every pdf page)
- I would rather want the filenames as the values in the
document
column instead of indexed numbers. - The following code contents only 2 pdf files for reproducibility. When I run all my files I get 294 documents in my
corpus
object, but when I tidy it I seem to loose some files becauseconverted %>% distinct(document)
gives 275 back. I wonder why that is.
I've got the The following reproducible script:
library(tidyverse)
library(tidytext)
library(pdftools)
library(tm)
library(broom)
# Create a temporary empty directory
# (don't worry at the end of this script I'll remove this directory and its files)
dir.create("~/Desktop/sample-pdfs")
# Fill directory with 2 pdf files from my github repo
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/'s-Gravenhage_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/'s-Gravenhage_coalitieakkoord.pdf")
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/Aa%20en%20Hunze_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/Aa en Hunze_coalitieakkoord.pdf")
# Create vector of file paths
dir <- "~/Desktop/sample-pdfs"
pdfs <- paste(dir, "/", list.files(dir, pattern = "*.pdf"), sep = "")
# Read the text from pdf's with pdftools package
pdfs_text <- map(pdfs, pdf_text)
# Convert to document-term-matrix
converted <- Corpus(VectorSource(pdfs_text)) %>%
DocumentTermMatrix()
# Now I want to convert this to a tidy format
converted %>%
tidy() %>%
filter(!grepl("[0-9]+", term))
With the following output:
# A tibble: 5,305 x 3
document term count
<chr> <chr> <dbl>
1 1 aan 158
2 1 aanbesteding 2
3 1 aanbestedingen 1
4 1 aanbevelingen 1
5 1 aanbieden 3
6 1 aanbieders 1
7 1 aanbod 8
8 1 aandacht 16
9 1 aandachtspunt 3
10 1 aandeel 1
# ... with 5,295 more rows
This seems to work out nicely but I would rather want the filenames ("'s-Gravenhage"
and "Aa en Hunze"
) as the values in the document column instead of indexed numbers. How do I do this?
Desired output:
# A tibble: 5,305 x 3
document term count
<chr> <chr> <dbl>
1 's-Gravenhage aan 158
2 's-Gravenhage aanbesteding 2
3 's-Gravenhage aanbestedingen 1
4 's-Gravenhage aanbevelingen 1
5 's-Gravenhage aanbieden 3
6 's-Gravenhage aanbieders 1
7 's-Gravenhage aanbod 8
8 's-Gravenhage aandacht 16
9 's-Gravenhage aandachtspunt 3
10 's-Gravenhage aandeel 1
# ... with 5,295 more rows
Delete downloaded files and its directory from desktop running the following line:
unlink("~/Desktop/sample-pdfs", recursive = TRUE)
All help is much appreciated!