The input data frame has three id columns and one raw_text. u_id corresponds to user, doc_id corresponds to the document of a particular user and sentence id corresponds to a sentence within a document of a user.
df <- data.frame(u_id=c(1,1,1,1,1,2,2,2),
doc_id=c(1,1,1,2,2,1,1,2),
sent_id=c(1,2,3,1,2,1,2,1),
text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
"pertinent results: 2105-4-16 05:02pm gap-14
2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
"method exists and the former because calls to the corresponding",
"admission date: 2001-4-19 discharge date: 2002-5-23 service:",
"pertinent results: 2105-4-16 05:02pm gap-14
2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
"method exists and the former because calls to the corresponding",
"method exists and the former because calls to the corresponding",
"method exists and the former because calls to the corresponding"))
Let's assume we need to extract all the dates and its location from raw_text. My approach so far -
#define a regex for date
date<-"([0-9]{2,4})[- . /]([0-9]{1,4})[- . /]([0-9]{2,4})"
#library
library(dplyr)
library(stringr)
#extract dates
df_i<-df %>%
mutate(i=str_extract_all(text,date)) %>%
mutate(date=lapply(i, function(x) if(identical(x, character(0))) NA_character_ else x)) %>%
unnest(date)
#extract date locations
df_ii<-str_locate_all(df$text,date)
n<-max(sapply(df_ii, nrow))
date_loc<-as.data.frame(do.call(rbind, lapply(df_ii, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x))))))
The date extractions are in data.frame format. Is there an approach to put the string_locations in a data.frame format corresponding to its id and string? Ideally, the output should be -
output<-data.frame(id=c(1,1,2,2,3),
text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
"admission date: 2001-4-19 discharge date: 2002-5-23 service:",
"pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
"pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
"pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 ."),
date=c("2001-4-19","2002-5-23","2105-4-16","2105-4-16","13.1 2105"),
date_start=c(17,43,20,74,96),
date_end=c(25,51,28,82,104))