converting plain text to data frame using dplyr in r

Question

I'm trying to use r convert plain text scraped from a pdf with pdftools and tidyverse into a data frame. I'm hoping for a solution using tidyverse packages. I've used the following code to get to a list of strings with my essential information:

library(tidyverse)
library(pdftools)

textdf <- pdf_text("raw pdf.pdf")

all_stats_lines <- textdf[3:28]%>%
  str_squish()%>%
  str_replace_all(",", "")%>%
  str_remove_all("\\+80% \\+80% \\+80% \\+40% \\+40% \\+40% Baseline Baseline Baseline \\-40% \\-40% 
  \\-40% \\-80% \\-80% \\-80% Sun Feb 16 Sun Mar 8 Sun Mar 29 Sun Feb 16 Sun Mar 8 Sun Mar 29 Sun Feb 
  16 Sun Mar 8 Sun Mar 29")%>% 
  str_remove_all("compared to baseline")%>%
  strsplit("   ")

This yields the following list of 26 lists of strings in the following format:


[[1]]
[1] "Alaska Variable 1 Variable 2 Variable 3 42  15  5" 
[2] "Variable 4 Variable 5 Variable 6 43  30  11"              
[3] "Alabama Variable 1 Variable 2 Variable 3 27  9  79"
[4] "Variable 4  Variable 5 Variable 6 20  23  4  "          

[[2]]
[1] "Arizona Variable 1 Variable 2 Variable 3 40  17  7" 
[2] "Variable 4 Variable 5 Variable 6 41  33  10"              
[3] "Arkansas Variable 1 Variable 2 Variable 3 29  7  81"
[4] "Variable 4  Variable 5 Variable 6 22  27  7  "   

... etc.

Note the state names at the beginning of sub list rows 1 and 3 and spaces within variable names. Each state should be one row. There should be 6 columns Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 with the corresponding values in order.

Any solution for how to build this table?

score 0 · Accepted Answer · answered Apr 05 '20 at 22:45

You have several problems you need to solve in order to do this.

splitting text to columns
combining pairs of rows
combining lists

If you are only going to do this once, it would probably be simpler to copy your data into Excel, format it there and then load it into R. But if you are committed to using R, then we need to work through each of these in order:

1) Text to columns

strsplit(string, " ") will split text strings on spaces. strsplit(string, "[[:space:]]+") will split on white space, treating consecutive white space as a single split. However you want to avoid separating "Variable 1" into "Variable" and "1", and separating "North Dakota" into "North" and "Dakota"

A quick solution here is gsub("North ", "North_", string) which will turn "North Dakota" into "North_Dakota" so the two words remain together when you apply strsplit.

2) Combining pairs of rows

You can use modulo to extract every second entry of a vector: 1:4 %% 2 will return c(1,0,1,0). This can be used to extract every second value like so: vec[1:4 %% 2 == 1].

Combining these in pairs gives you something like the following:

vec = c('a', 'b', 'c', 'd')
paste(vec[1:4 %% 2 == 1], vec[1:4 %% 2 == 0])

3) Combining lists

The easiest solution here is probably unlist. But you could also use the reduce function in the purrr package.

Combining it all

data = unlist(data)
data = trimws(data)
nn = length(data)
data = paste(data[1:nn %% 2 == 1], data[1:nn %% 2 == 0])

# add other rules here for state names that are two words
data = gsub("ariable ", "ariable_", data)
data = gsub("North ", "North_", data)

data %>%
  strsplit("[[:space:]]+") %>%
  purrr::reduce(rbind) %>%
  as.data.frame()

Thanks so much for your effort on this excellent response. It solved my problem perfectly and taught me a lot about compiling scraped text. — user1988, Apr 06 '20 at 16:32
You are welcome. When you find an answer helpful, best practice is to upvote or accept it as the correct answer. — Simon.S.A., Apr 06 '20 at 20:39
Just did thanks (my upvote won't count though from lack of reps). — user1988, Apr 07 '20 at 22:10

converting plain text to data frame using dplyr in r

1 Answers1

1) Text to columns

2) Combining pairs of rows

3) Combining lists

Combining it all