3

I am new to text mining, R and the tidy approach and am looking for kind advice to overcome a hurdle with pre-processing text strings read in from pdf files. The specific problem is with a multiple string replacement over multiple strings.

I have data from 2 sources:

  1. PDF reports: I have used map and pdf_text functions to read a directory of pdf reports into a data frame which creates a tibble with 3 columns: page_string, filename and pagenumber. There are 1,191 entries, and page_string holds a string being one page of pdf text.
  2. CSV file of professional words and replacements: I have used the read_CSV function to import this. The resultant df has 2 columns with 77 entries: target_vocab (e.g. social worker) and replace_token (e.g. social_worker).

My aim is to amend the current character strings in my main data frame, replacing strings which match the professional words in target_vocab with the associated compound token in replace_token prior to tokenization.

String example - before and after string substitution:

  1. "Social workers and early help staff work with multi-agency partners to produce child in need plans led by the allocated social worker".
  2. "Social_workers and early_help staff work with multi_agency partners to produce CIN plans led by the allocated social_worker".

It is hopefully clear that I want "social workers", "early help", "multi-agency", "child in need" and "social worker" replaced with compound tokens.

My code:

#a bank of pdf reports and "professional_words.csv" in current working directory

library(tidyverse)
library(pdftools)
#> Using poppler version 0.73.0
library(tidytext)
library(stringr)

pdf_filenames <- list.files(pattern = "pdf$")

words_df <- read_csv("professional_words.csv", skip = 1, col_names = c("target_vocab", "replace_token"))

pattern_vector <- words_df$target_vocab
replacement_vector <- words_df$replace_token 

pdf_pages_df <- map_df(pdf_filenames, ~ tibble(page_string = pdf_text(.x)) %>%
         mutate(filename = .x, pagenumber = row_number()) %>%
           mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector))) 

The bit that doesn't work within the map function is:

mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))

I have tried all sorts of variations, including gsub, breaking it away from the pipe to a separate map function etc. but with my limited knowledge I am not fixing it.

I have consistently had the warning:

In stri_replace_all_regex(string, pattern, fix_replacement(replacement), : longer object length is not a multiple of shorter object length

With this variation of code I am also getting the error:

Problem with mutate() input page_string. x Input page_string can't be recycled to size 10. ℹ Input page_string is str_replace_all(page_string, pattern = pattern_vector, replacement = replace_vector). ℹ Input page_string must be size 10 or 1, not 77.

My sense is that map or list functions will help me but I seem to be going round in circles and I haven't yet found a Stack Overflow response that has helped me fix the problem.

3 Answers3

4

There is a way to do what you want with str_replace_all from stringr. Instead of providing a pattern and a replacement, pass a named vector to pattern. Something like pattern = c("social worker" = social_worker", "early help" = "early_help", "multi agency" = "multi_agency"). I'll start with a simple example, and then show you how to have R build that named vector from your words_df.

# Simple example
library(stringr)
string <- "The quick brown fox"
str_replace_all(string, pattern = c("brown" = "green", "fox" = "badger"))
[1] "The quick green badger"

Here is how you do it with some fake data that looks like yours, having R build the named replacement vector.

# Making the fake data
words_df <- data.frame(target = c("fox", "brown", "quick"),
                       replacement = c("badger", "green", "versatile"))

strings_df <- data.frame(page_string = c("The quick brown fox",
                                         "The sad yellow fox",
                                         "The quick old dog",
                                         "The lazy brown dog",
                                         "The quick happy fox"))

# Making the named replacement vector from words_df
replacements <- c(words_df$replacement)
names(replacements) <- c(words_df$target)

# Doing the replacement
library(dplyr)
strings_df %>% 
  mutate(new_string = str_replace_all(page_string, 
                                      pattern = replacements))

# The output
          page_string                 new_string
1 The quick brown fox The versatile green badger
2  The sad yellow fox      The sad yellow badger
3   The quick old dog      The versatile old dog
4  The lazy brown dog         The lazy green dog
5 The quick happy fox The versatile happy badger
Ben Norris
  • 5,639
  • 2
  • 6
  • 15
  • Enormously helpful. For my learning, can you tell me how names() works? When I inspect the named vector replacements, I simply see the replacement words and the target words are not visible. It works though! – Charlotte Waits Aug 25 '20 at 22:52
  • If you use `names(object)` it will return the names of the object (e.g. column names in a data.frame). If you use `names(object) <- ` it will set the names of the object to whatever you assign. – Ben Norris Aug 25 '20 at 22:59
1

str_replace_all does not work like that. If you provide vectors for pattern and replacement, the first pattern/replacement is applied to the first element of string and so on. See the following example:

library(stringr)

fruits <- c("one apple two", "two pears", "three bananas")
pattern_v <- c("one", "two", "three")
replace_v <- c("1", "2", "3")
str_replace_all(fruits, pattern_v, replace_v)
#> [1] "1 apple two" "2 pears"     "3 bananas"

Created on 2020-08-25 by the reprex package (v0.3.0)

Note that "two" gets only replaced with "2" in the second element of string. Therefore, it doesn't work if the pattern/replacement vectors are not of the same length (or a multiple) of string:

pattern_v <- c("one", "two")
replace_v <- c("1", "2")
str_replace_all(fruits, pattern_v, replace_v)
[1] "1 apple two"   "2 pears"       "three bananas"
warning:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement),  :
  longer object length is not a multiple of shorter object length

To circumvent this problem, you can pass a named vector for pattern:

str_replace_all(fruits, c("one" = "1", "two" = "2", "three" = "3"))
[1] "1 apple 2" "2 pears"   "3 bananas"

Ben's answer gives a great way how to make the creation of the vector easy:

pattern_new <- c("one", "two", "three")
names(pattern_new) <- c("1", "2", "3")
str_replace_all(fruits, pattern_new)
[1] "one apple two" "two pears"     "three bananas"
starja
  • 9,887
  • 1
  • 13
  • 28
  • I am not sure how names() works but it does and that is so helpful. I will post my finished code for others that are struggling. – Charlotte Waits Aug 25 '20 at 22:54
0

Problem solved thanks to speedy responses and here is the working code to resolve my question for those that may be struggling in the future:

professional_terms <- c(words_df$replace_token)
names(professional_terms) <- c(words_df$target_words) 
pdf_pages_df <- map_df(pdf_filenames, ~ tibble(page_string = pdf_text(.x)) %>%
mutate(filename = .x, pagenumber = row_number(), page_string = str_replace_all(page_string,pattern = professional_terms)))