0

I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history narrative data. These are paragraphs of natural language, bounded by unique words that are standardized throughout all the individual documents. See below for a reproduced example.

Is there a way I can use those two unique words, ("CASE HISTORY & INVESTIGATOR:"), to bound the text I would like to extract? If not, what sort of approach can I take to extracting the narrative data I need from each report?

text_data <- list("ES                     SPRINGFEILD POLICE DE     FARRELL #789\n NOTIFIED                  DATE           TIME               OFFICER\nMARITAL STATUS:       UNKNOWN\nIDENTIFIED BY:    H. POIROT                     AT:   SCENE              DATE:    01/02/1895\nFINGERPRINTS TAKEN BY                         DATE\n YES                      NO                  OBIWAN KENOBI                            01/02/1895\n
              SPRINGFEILD\n CASE#:       012-345-678\n ABC NOTIFIED:                                    ABC DATE:\n ABC OFFICER:                                           NATURE:\nCASE HISTORY\n    This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n                                            Case#:           012-345-678\n                          EXAMINER / INVESTIGATOR'S REPORT\n                                 CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n     the next capitalized word, investigator with a colon, is a unique word where the string stops.\nINVESTIGATOR:       HERCULE POIROT             \n")

Here is what the expected output would be.

output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n                                            Case#:           012-345-678\n                          EXAMINER / INVESTIGATOR'S REPORT\n                                 CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n     the next capitalized word, investigator with a colon, is a unique word where the string stops.")

Thanks so much for helping!

  • 1
    Please show the expected output – akrun Mar 01 '21 at 18:13
  • @akrun - edited the post, but here is what I would need: output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.") – Averysaurus Mar 01 '21 at 19:04

2 Answers2

1

One quick approach would be to use gsub and regexes to replace everything up to and including CASE HISTORY ('^.*CASE HISTORY') and everything after INVESTIGATOR: ('INVESTIGATOR:.*') with nothing. What remains will be the text between those two matches.

gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "\n    This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n                                            Case#:           012-345-678\n                          EXAMINER / INVESTIGATOR'S REPORT\n                                 CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n     the next capitalized word, investigator with a colon, is a unique word where the string stops.\n"

Colin H
  • 600
  • 4
  • 9
  • update: This line manages to capture the first instance of the narrative text. narrative_text <- unlist(lapply(str_split(text_data, keywords), "[[", 2)) Still thinking on how to write a function that iterates over all instances in the dataset. – Averysaurus Mar 01 '21 at 21:05
0

After much deliberation I came to a solution I feel is worth sharing, so here we go:

# unlist text_data
file_contents_unlist <- 
paste(unlist(text_data), collapse = " ")

# read lines, squish for good measure. 
file_contents_lines <- 
file_contents_unlist%>% 
readr::read_lines() %>% 
str_squish()

# Create indicies in the lines of our text data based upon regex grepl 
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)", 
                            file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)", 
                            file_contents_lines))

# function basically states, "give me back whatever's in those indices".
 pull_case_num <- 
  function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
  )
    } 
 
 # map2() to iterate. 
 case_nums <- map2(index_case_num_1, 
              index_case_num_2, 
              pull_case_num) 

# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)

# Repeat pattern for other vectors as needed. 
index_case_hist_1 <- 
  which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <- 
  which(grepl("Case#: ", file_contents_lines))

pull_case_hist <- function(index_case_hist_1, 
                       index_case_hist_2 )
 {(file_contents_lines[index_case_hist_1:index_case_hist_2]
    )
    } 

 case_hist <- map2(index_case_hist_1, 
              index_case_hist_2, 
              pull_case_hist)
 case_hist_df <- as.data.frame.character(case_hist)

  # cbind() the vectors, also a good call place to debug from. 
 cases_comp <- cbind(case_nums_df, case_hist_df)

Thanks all for responding. I hope this solution helps someone out there in the future. :)