0

I'm working in R, trying to prepare text documents for analysis. Each document is stored in a column (aptly named, "document") of dataframe called "metaDataFrame." The documents are strings containing articles and their BibTex citation info. Data frame looks like this:

[1] filename         document                          doc_number
[2] lithuania2016    Commentary highlights Estonian...    1
[3] lithuania2016    Norwegian police, immigration ...    2
[4] lithuania2016    Portugal to deply over 1,000 m...    3

I want to extract the BibTex information from each document into a new column. The citation information begins with "Credit:" but some articles contain multiple "Credit:" instances, so I need to extract all of the text after the last instance. Unfortunately, the string is only sometimes preceded by a new line.

My solution so far has been to find all of the instances of the string and save the location of the last instance of "Credit:" in each document in a list:

locate.last.credit <- lapply(gregexpr('Credit:', metaDataFrame$document), tail, 1)

This provides a list of integer locations of the last "Credit:" string in each document or a value of "-1" where no instance is found. (Those missing values pose a separate but related problem I think I can tackle after resolving this issue).

I've tried variations of strsplit, substr, stri_match_last, and rm_between...but can't figure out a way to use the character position in lieu of regular expression to extract this part of the string.

How can I use the location of characters to manipulate a string instead of regular expressions? Is there a better approach to this (perhaps with regex)?

1 Answers1

2

How about like this:

test_string <- " Portugal to deply over 1,000 m Credit: mike jones Credit: this is the bibliography"

gsub(".*Credit:\\s*(.*)", "\\1", test_string, ignore.case = TRUE)

[1] "this is the bibliography"

The Regex pattern is looking for Credit, but because it's preceeded by .*, it's going to find the last instance of the word (if you wanted the first instance of Credit, you'd use .*?). \\s* matches 0 or more white space characters after credit and before the rest of the text. We then capture the remainder of each document in (.*), as capture group 1. And we return \\1. Also, I use ignore.case = TRUE so credit, CREDIT, and Credit will all be matched.

And with your object it would be:

gsub(".*Credit:\\s*(.*)", "\\1", metaDataFrame$document, ignore.case = TRUE)
Mako212
  • 6,787
  • 1
  • 18
  • 37