0

I'm looking to extract patient IDs from an email subject line. I'm working with two data frames: One has output from an SQL database (contains email subject line) and the other has patient information (hospital name and patient ID).

I want to use the the patient ID and scrub the subject line from the first data frame and return the hospital associated to said patient. Unfortunately I am cannot provide access to the data.

## Example Data

Data frame 1 example row:

Column 1 (from_Email): xxxxx@hospital.com 

Column 2 (Time_IN): 1/11/2000 12:00:00

Column 3 (from_Subject): Patient H2445JFLD presented into ER with .... symptoms

Data frame 2 example row:

Column 1 (Hospital Name): Hospital ABC

Column 2 (Patient ID): H2445JFLD 
d.b
  • 32,245
  • 6
  • 36
  • 77
  • "Unfortunately I am cannot provide access to the data." No, but you can provide a couple of rows of **example** data, which realistically reflects the type of data you'll be receiving without actually being part of your data set. For example, if the data was, say, tracking college students' grades (which is also legally protected), you could provide records describing John Q. Taxpayer and Jane Doe's academic records. You can also provide a [mcve] illustrating what you've tried already, and why that's not working. – Nic Feb 21 '17 at 01:13

1 Answers1

1

Since you've shared just one row of data, I'm not certain of the pattern of email subject line from_Subject. If it is an automated emailing system then there is a fixed pattern of email subject line from_Subject. I've provided you 3 ways to extract the Patient_ID from from_Subject.

library(dplyr)

df1 <- data_frame(from_Email = "xxxxx@hospital.com",
                  Time_IN = "1/11/2000 12:00:00",
                  from_Subject = "Patient H2445JFLD presented into ER with .... symptoms")

df2 <- data_frame(Hospital_Name = "Hospital ABC",
                  Patient_ID = "H2445JFLD")

# Extract 2nd word from the subject line
df1 <- df1 %>% mutate(Patient_ID = stringr::word(from_Subject, 2))
# Extract the word after "Patient" from the subject line
df1 <- df1 %>% mutate(Patient_ID = str_extract(df1$from_Subject, '(?<=Patient\\s)\\w+'))
# Extract a word of length 9 that has characters A-Z and 0-9 from the subject line
df1 <- df1 %>% mutate(Patient_ID = str_extract(df1$from_Subject, '\\b[A-Z0-9]{9}\\b'))

Once you've extracted the Patient_ID, then it is a simple left join that you'd need to do.

left_join(df1, df2, on="Patient_ID")
#Joining, by = "Patient_ID"
# A tibble: 1 × 5
#  from_Email            Time_IN         from_Subject                                            Patient_ID Hospital_Name
#  <chr>                 <chr>            <chr>                                                  <chr>       <chr>
#1 xxxxx@hospital.com 1/11/2000 12:00:00 Patient H2445JFLD presented into ER with .... symptoms  H2445JFLD  Hospital ABC
Karthik Arumugham
  • 1,300
  • 1
  • 11
  • 18