Extract string and its location using dplyr/tidyr approach

Question

The input data frame has three id columns and one raw_text. u_id corresponds to user, doc_id corresponds to the document of a particular user and sentence id corresponds to a sentence within a document of a user.

df <- data.frame(u_id=c(1,1,1,1,1,2,2,2),
                 doc_id=c(1,1,1,2,2,1,1,2),
                 sent_id=c(1,2,3,1,2,1,2,1),
                 text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                               "pertinent results: 2105-4-16 05:02pm gap-14 
                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                               "method exists and the former because calls to the corresponding",
                        "admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                        "pertinent results: 2105-4-16 05:02pm gap-14 
                        2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                        "method exists and the former because calls to the corresponding",
                        "method exists and the former because calls to the corresponding",
                        "method exists and the former because calls to the corresponding"))

Let's assume we need to extract all the dates and its location from raw_text. My approach so far -

#define a regex for date
date<-"([0-9]{2,4})[- . /]([0-9]{1,4})[- . /]([0-9]{2,4})"

#library
library(dplyr)
library(stringr)

#extract dates
df_i<-df %>% 
  mutate(i=str_extract_all(text,date)) %>% 
  mutate(date=lapply(i, function(x) if(identical(x, character(0))) NA_character_ else x)) %>% 
  unnest(date)

#extract date locations
df_ii<-str_locate_all(df$text,date)
n<-max(sapply(df_ii, nrow))
date_loc<-as.data.frame(do.call(rbind, lapply(df_ii, function (x) 
  rbind(x, matrix(, n-nrow(x), ncol(x))))))

The date extractions are in data.frame format. Is there an approach to put the string_locations in a data.frame format corresponding to its id and string? Ideally, the output should be -

output<-data.frame(id=c(1,1,2,2,3),
               text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                      "admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                      "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                      "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                      "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 ."),
               date=c("2001-4-19","2002-5-23","2105-4-16","2105-4-16","13.1 2105"),
               date_start=c(17,43,20,74,96),
               date_end=c(25,51,28,82,104))

acylam · Accepted Answer · 2017-09-13T20:50:34.793

You can do this:

regex = "\\b[0-9]+[-][0-9]+[-][0-9]+\\b"
df_i = str_extract_all(df$text, regex) 
df_ii = str_locate_all(df$text, regex) 

output1 = Map(function(x, y, z){
  if(length(y) == 0){
    y = NA
  }
  if(nrow(z) == 0){
    z = rbind(z, list(start = NA, end = NA))
  }
  data.frame(id = x, date = y, z)
}, df$id, df_i, df_ii) %>%
  do.call(rbind,.) %>%
  merge(df, .)

or stick with piping-only syntax:

regex = "[0-9]+[-][0-9]+[-][0-9]+"

output1 = df %>%
  {list(.$id, str_extract_all(.$text, regex), 
       str_locate_all(.$text, regex))} %>%
  {Map(function(x, y, z){
    if(length(y) == 0){
      y = NA
    }
    if(nrow(z) == 0){
      z = rbind(z, list(start = NA, end = NA))
    }
    data.frame(id = x, date = y, z)
  }, .[[1]], .[[2]], .[[3]])} %>%
  do.call(rbind, .) %>%
  merge(df, .)

Result:

  id
1  1
2  1
3  2
4  2
5  2
6  3
                                                                                                                 text
1                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service:
2                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service:
3 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
4 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
5 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
6                                                     method exists and the former because calls to the corresponding
       date start end
1 2001-4-19    17  25
2 2002-5-23    43  51
3 2105-4-16    20  28
4 2105-4-16    77  85
5 2105-4-16   104 112
6      <NA>    NA  NA

Notes:

Your regular expression incorrectly extracts "13.1" from "rdw-13.1 2105-4-16" because you added spaces in your [- . /]. date<-"([0-9]{2,4})[-./]([0-9]{1,4})[-./]([0-9]{2,4})" should do it.
mutate allows you to use a variable you have just created inside the same function call, so there is no need to use two separate mutate's for df_i.
For my pipping-only solution, {} are needed around list() and Map() to override the dplyr default of feeding in the output from the previously step to the first argument of the next function.

For instance:

df %>%
      list(.$id, str_extract_all(.$text, regex), 
                 str_locate_all(.$text, regex))

becomes:

list(df, df$id, str_extract_all(df$text, regex), 
                str_locate_all(df$text, regex))

which is not what we want.

Edits:

OP updated his df to include rows where text does not include any dates. This would cause my original solution to fail since some elements of the list from str_extract_all and str_locate_all would have length(0) and nrow(0). I solved this issue by adding two if statements:

if(length(y) == 0){
  y = NA
}
if(nrow(z) == 0){
  z = rbind(z, list(start = NA, end = NA))
}

This makes dates = "NA and adds a row of NA's to start and end for those rows with no dates. This allows id to have one row to bind to in the data.frame step.

Thank you for your response. Appreciate it. I tried to implement the first solution. But it gives me error stating - "Error in data frame: argument imply different number of rows:1,0" — x1carbon, Sep 13 '17 at 20:06
@x1carbon That could be because I used the same name `df_i` and `df_ii` as your dates and location variables, and they are different objects. Try clearing your workspace and only run my code. — acylam, Sep 13 '17 at 20:12
I checked for this issue. So, there might be cases where a text will not have any dates which will lead to string and location extract to character(0) and integer(0). My bad, let me correct the question. — x1carbon, Sep 13 '17 at 20:28
@x1carbon ok. A follow up question would then be, for those rows with no dates, do you want to keep them and add NA's to `date`, `start`, and `end`? or simply remove them? — acylam, Sep 13 '17 at 20:33
We would like to keep the NA`s. The reason being, input data frame has u_id - user id, doc_id - document id for a user, and sent_id - sentence within a document. In the output, we would like to keep it to a similar structure so that we could map the date back to its sentence, to its document for a particular user. This was the reason I wanted to keep it dplyr structure. — x1carbon, Sep 13 '17 at 20:49
@x1carbon What do you mean by multiple id's? do you mean composite id's made up of multiple columns? This discussion is getting lengthy. You might want to accept this answer, and ask a new question about the multiple ids issue and link that new question to this one. In your new question, you don't need to put your original code, since that is dealt with in this question. — acylam, Sep 13 '17 at 21:11
@x1carbon If my answer is useful, don't forget to accept it :) — acylam, Sep 14 '17 at 14:13

Extract string and its location using dplyr/tidyr approach

1 Answers1