Create Dataframe from pdf to csv based on string

Question

I like to split information of a pdf document based on the presence of colon. A sample is here.

Updated PDF with four pages can be downloaded from this link

I am attempting the following. After reading the pdf, I am trying to split it by colon.

library(textreadr)
dat <- '~Here is the thing1.pdf' %>%
    textreadr::read_pdf()
dat
Source: local data frame [26 x 3]

   page_id element_id                                     text
1        1          1                       Here is the thing.
2        1          2                                Case ID 1
3        1          3 Exploring Angels: It is a long establish
4        1          4 page when looking at its layout. The poi
5        1          5 distribution of letters, as opposed to u
6        1          6 English. Many desktop publishing package
7        1          7 model text, and a search for 'lorem ipsu
8        1          8 versions have evolved over the years, so
9        1          9                           and the like).
10       1         10 New agency: Lorem Ipsum is simply dummy 
..     ...        ...                                      ...

OR

library(pdftools)
dat <- pdf_text("~Here is the thing1.pdf")
dat1 <- strsplit(dat[[1]], "\n")[[1]]
head(dat1)
[1] "Here is the thing.\r"                                                                                           
[2] "Case ID 1\r"                                                                                                    
[3] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a\r"
[4] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal\r"         
[5] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable\r"      
[6] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default\r"

dat2 <- dat1 %>%
  str_split(pattern = "\r") 
head(dat2)

[[1]]
[1] "Here is the thing." ""                  

[[2]]
[1] "Case ID 1" ""         

[[3]]
[1] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a"
[2] ""                                                                                                             

[[4]]
[1] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal"
[2] ""                                                                                                    

[[5]]
[1] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable"
[2] ""                                                                                                       

[[6]]
[1] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default"
[2] "

I want to get my data sorted into a table like this:

  Case.ID                             Exploring.Angels                        New.agency New.Factor New.Factor2 Creative.One
1       1 It is a long established fact that a reader  Lorem Ipsum is simply dummy text         ABC         BNM         <NA>
2       2               Various versions have evolved     It has survived not only five         ABC        <NA>          DFZ

No explanation for what `New.Factor` , `New.Factor2` , `Creative.One` are? — InfiniteFlash, Dec 24 '17 at 22:38
The answer to your question is one tailored to your pdf thorugh a parsed script. — InfiniteFlash, Dec 24 '17 at 22:41
Also, can you link me to your pdf? It'd be very helpful to me if you provided a sample pdf. — InfiniteFlash, Dec 24 '17 at 22:43
Here is the link: https://www.dropbox.com/s/gnngz6l4mts19lj/Here%20is%20the%20thing.pdf?dl=0 — S Das, Dec 24 '17 at 23:17
`New.Factor`,`New.Factor2` and `Creative.One` are just information regarding particular `Case ID`. I want to separate the information based on `:` for each `Case ID`. — S Das, Dec 24 '17 at 23:19
This problem has already eaten up a couple hours of time. This is just not my cup of tea, but I will eventually have a solution. Just not soon. — InfiniteFlash, Dec 25 '17 at 00:23

dmi3kno · Accepted Answer · 2017-12-25T02:09:13.150

4

Here's how I would do it using tidyverse

library(tidyverse)

# read in the file, separate by line, convert to tibble
pdftools::pdf_text("../_xlam/Here is the thing1.pdf") %>% str_split("(\\r\\n)") %>% 
  unlist() %>% as_tibble() %>% 
# separate cases and mark lines containing colon
  mutate(case=cumsum(str_detect(value, "Case ID")),
         tag_line=str_detect(value, ": ")) %>%
# drop lines with Case ID, separate tag from text, move text into one column, fill the tags
  filter(!str_detect(value,"Case ID")) %>% 
  separate(value, into = c("key", "text"), sep=": ", fill="right", extra="merge") %>% 
  mutate(text=ifelse(is.na(text), key, text),
         key=ifelse(tag_line, key, NA)) %>% fill(key) %>% 
# summarize text by concatenation
  group_by(case, key) %>% summarise(text=paste(text, collapse = " ")) %>% 
# filter away the `Here is the thing` line 
  drop_na(key) %>%
# move values to columns
  spread(key=key, value=text)

edited Dec 25 '17 at 02:09

answered Dec 25 '17 at 00:53

dmi3kno

2,943
17
31

Thanks for the excellent solution. For multiple pages of pdf, why do I need to use in place of `.[[1]]`? – S Das Dec 25 '17 at 01:48
1

This is much better than the one I had prepared. Thanks a bunch. Lots to learn here for me. – InfiniteFlash Dec 25 '17 at 01:52
1

Oh, you could continue operate in lists until you get to data frame. So you would do `map(as_tibble) %>%imap_dfr(mutate(.x, page=.y))%>%` and continue piping. Post couple of pages to dropbox and I will amend the script – dmi3kno Dec 25 '17 at 01:54
Thanks a lot. Here is an updated pdf with four pages. https://www.dropbox.com/s/zy9cctdxg5zhnd6/Here%20is%20the%20thing1.pdf?dl=0 – S Das Dec 25 '17 at 02:03
1

ha! it still works. Just replace `.[[1]]` with `unlist()` – dmi3kno Dec 25 '17 at 02:08
The thing is, @SDas that the page number does not carry any information ( you have paragraphs that flow across pages. Otherwise you might have done as I suggested - record the page number as a column in a tibble and only then rbind them – dmi3kno Dec 25 '17 at 02:16

Create Dataframe from pdf to csv based on string

1 Answers1