-1

I'm searching on how to use wildcard characters as part of the removal criteria for a section of a corpus. I was unable to find anything on SO or google related to this issue.

Purpose: Analyzing large dataset of standardized notes where employee input is broken into sections of the text.

Example data:

***Date; Area: asdfwerqw Detail: xxxxx Requested Action: xxxxxx Assigned to: John Doe

Portion to extract for analysis:

Detail:xxxxx Requested Action:xxxxxx

Number of items before Detail may be more. Also, Assigned to: may not appear.

Community
  • 1
  • 1

1 Answers1

0

It's hard to tell without more examples and details, but you're probably going to want to use regular expressions with positive lookahead and optional items:

library(stringr)

text <- c("***Date; Area: asdfwerqw Detail: xxxxx Requested Action: xxxxxx Assigned to: John Doe")

str_extract_all(text, c("Detail(.*?)(?=Requested Action:)", "Requested Action:((.*?)(?=Assigned to:))?"))

# [[1]]
# [1] "Detail: xxxxx "
# 
# [[2]]
# [1] "Requested Action: xxxxxx "
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116