I'm trying to use r convert plain text scraped from a pdf with pdftools and tidyverse into a data frame. I'm hoping for a solution using tidyverse packages. I've used the following code to get to a list of strings with my essential information:
library(tidyverse)
library(pdftools)
textdf <- pdf_text("raw pdf.pdf")
all_stats_lines <- textdf[3:28]%>%
str_squish()%>%
str_replace_all(",", "")%>%
str_remove_all("\\+80% \\+80% \\+80% \\+40% \\+40% \\+40% Baseline Baseline Baseline \\-40% \\-40%
\\-40% \\-80% \\-80% \\-80% Sun Feb 16 Sun Mar 8 Sun Mar 29 Sun Feb 16 Sun Mar 8 Sun Mar 29 Sun Feb
16 Sun Mar 8 Sun Mar 29")%>%
str_remove_all("compared to baseline")%>%
strsplit(" ")
This yields the following list of 26 lists of strings in the following format:
[[1]]
[1] "Alaska Variable 1 Variable 2 Variable 3 42 15 5"
[2] "Variable 4 Variable 5 Variable 6 43 30 11"
[3] "Alabama Variable 1 Variable 2 Variable 3 27 9 79"
[4] "Variable 4 Variable 5 Variable 6 20 23 4 "
[[2]]
[1] "Arizona Variable 1 Variable 2 Variable 3 40 17 7"
[2] "Variable 4 Variable 5 Variable 6 41 33 10"
[3] "Arkansas Variable 1 Variable 2 Variable 3 29 7 81"
[4] "Variable 4 Variable 5 Variable 6 22 27 7 "
... etc.
Note the state names at the beginning of sub list rows 1 and 3 and spaces within variable names. Each state should be one row. There should be 6 columns Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 with the corresponding values in order.
Any solution for how to build this table?