I'm working in R, trying to prepare text documents for analysis. Each document is stored in a column (aptly named, "document") of dataframe called "metaDataFrame." The documents are strings containing articles and their BibTex citation info. Data frame looks like this:
[1] filename document doc_number
[2] lithuania2016 Commentary highlights Estonian... 1
[3] lithuania2016 Norwegian police, immigration ... 2
[4] lithuania2016 Portugal to deply over 1,000 m... 3
I want to extract the BibTex information from each document into a new column. The citation information begins with "Credit:" but some articles contain multiple "Credit:" instances, so I need to extract all of the text after the last instance. Unfortunately, the string is only sometimes preceded by a new line.
My solution so far has been to find all of the instances of the string and save the location of the last instance of "Credit:" in each document in a list:
locate.last.credit <- lapply(gregexpr('Credit:', metaDataFrame$document), tail, 1)
This provides a list of integer locations of the last "Credit:" string in each document or a value of "-1" where no instance is found. (Those missing values pose a separate but related problem I think I can tackle after resolving this issue).
I've tried variations of strsplit, substr, stri_match_last, and rm_between...but can't figure out a way to use the character position in lieu of regular expression to extract this part of the string.
How can I use the location of characters to manipulate a string instead of regular expressions? Is there a better approach to this (perhaps with regex)?