I am new to text mining, R and the tidy approach and am looking for kind advice to overcome a hurdle with pre-processing text strings read in from pdf files. The specific problem is with a multiple string replacement over multiple strings.
I have data from 2 sources:
- PDF reports: I have used map and pdf_text functions to read a directory of pdf reports into a data frame which creates a tibble with 3 columns: page_string, filename and pagenumber. There are 1,191 entries, and page_string holds a string being one page of pdf text.
- CSV file of professional words and replacements: I have used the read_CSV function to import this. The resultant df has 2 columns with 77 entries: target_vocab (e.g. social worker) and replace_token (e.g. social_worker).
My aim is to amend the current character strings in my main data frame, replacing strings which match the professional words in target_vocab with the associated compound token in replace_token prior to tokenization.
String example - before and after string substitution:
- "Social workers and early help staff work with multi-agency partners to produce child in need plans led by the allocated social worker".
- "Social_workers and early_help staff work with multi_agency partners to produce CIN plans led by the allocated social_worker".
It is hopefully clear that I want "social workers", "early help", "multi-agency", "child in need" and "social worker" replaced with compound tokens.
My code:
#a bank of pdf reports and "professional_words.csv" in current working directory
library(tidyverse)
library(pdftools)
#> Using poppler version 0.73.0
library(tidytext)
library(stringr)
pdf_filenames <- list.files(pattern = "pdf$")
words_df <- read_csv("professional_words.csv", skip = 1, col_names = c("target_vocab", "replace_token"))
pattern_vector <- words_df$target_vocab
replacement_vector <- words_df$replace_token
pdf_pages_df <- map_df(pdf_filenames, ~ tibble(page_string = pdf_text(.x)) %>%
mutate(filename = .x, pagenumber = row_number()) %>%
mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))
The bit that doesn't work within the map function is:
mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))
I have tried all sorts of variations, including gsub, breaking it away from the pipe to a separate map function etc. but with my limited knowledge I am not fixing it.
I have consistently had the warning:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement), : longer object length is not a multiple of shorter object length
With this variation of code I am also getting the error:
Problem with
mutate()
inputpage_string
. x Inputpage_string
can't be recycled to size 10. ℹ Inputpage_string
isstr_replace_all(page_string, pattern = pattern_vector, replacement = replace_vector)
. ℹ Inputpage_string
must be size 10 or 1, not 77.
My sense is that map or list functions will help me but I seem to be going round in circles and I haven't yet found a Stack Overflow response that has helped me fix the problem.