Use R to read a text file and format extracted data in to a table

Question

I have a text file in the following basic format which repeats a few thousand times:

Patient Name- John Smith
Number of dx codes: 123
Number of pr codes: 678
Charges: 910
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Duis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. 
Donec interdum iaculis lacus. Nunc in placerat augue. 
In ut odio et dui aliquam sagittis at id augue. 
Patient Name- Jane Smith
Number of dx codes: 234
Number of pr codes: 567
Charges: 1011

How can I best get the above text into the following format

Patient Name    DxCodes    PrCodes    Charges
John Smith      123        678        910
Jane Smith      234        567        1011

I have been able to use str_extract from the stringi package to extract all the Patient Names into one dataframe and DxCodes, PrCodes, and Charges into another dataframe as such:

Names
John Smith
Jane Smith

And

Number of dx codes: 123
Number of pr codes: 678
Charges: 910
Number of dx codes: 234
Number of pr codes: 567
Charges: 1011

But am unsure about how to proceed to get the above two dataframes into the desired format? Should I be using a different approach from the start? Would definitely appreciate any help. Thank you!

Please include the code you've been working with so others can help — Tung, Jun 26 '18 at 16:44
Is "Patient name" always the first string at the beginning of each desired block? Is "Charges" always the first string of the last line of each desired block? — Nicolás Velasquez, Jun 26 '18 at 17:04
That is correct. For each block, the order and the first string for each line is always the same. — user6340762, Jun 26 '18 at 17:14

score 3 · Answer 1 · answered Jun 26 '18 at 17:06

You can use a sequence of regex's and then assemble the pieces together with data.frame().

inx1 <- grep("Patient Name", txt)
inx2 <- grep("Number of dx codes:", txt)
inx3 <- grep("Number of pr codes:", txt)
inx4 <- grep("Charges", txt)

PatientName <- sub("^Patient Name[- ]*", "", txt[inx1])
DxCodes <- sub("^.*: *([[:digit:]]*)$", "\\1", txt[inx2])
PrCodes <- sub("^.*: *([[:digit:]]*)$", "\\1", txt[inx3])
Charges <- sub("^.*: *([[:digit:]]*)$", "\\1", txt[inx4])

DxCodes <- as.integer(DxCodes)
PrCodes <- as.integer(PrCodes)
Charges <- as.integer(Charges)

result <- data.frame(PatientName, DxCodes, PrCodes, Charges)
result
#  PatientName DxCodes PrCodes Charges
#1  John Smith     123     678     910
#2  Jane Smith     234     567    1011

Data.

conn <- textConnection("
Patient Name- John Smith
Number of dx codes: 123
Number of pr codes: 678
Charges: 910
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Duis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. 
Donec interdum iaculis lacus. Nunc in placerat augue. 
In ut odio et dui aliquam sagittis at id augue. 
Patient Name- Jane Smith
Number of dx codes: 234
Number of pr codes: 567
Charges: 1011
")

txt <- readLines(conn)
close(conn)

I like the simpler aspect of your answer that doesn't require splitting the vector by patient. +1 — r2evans, Jun 26 '18 at 17:08

score 1 · Answer 2 · answered Jun 26 '18 at 17:07

Here's an implementation that assumes the order of messages within a patient's block of text.

Data:

txt <- c(
  'Patient Name- John Smith',
  'Number of dx codes: 123',
  'Number of pr codes: 678',
  'Charges: 910',
  'Lorem ipsum dolor sit amet, consectetur adipiscing elit. ',
  'Duis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. ',
  'Donec interdum iaculis lacus. Nunc in placerat augue. ',
  'In ut odio et dui aliquam sagittis at id augue. ',
  'Patient Name- Jane Smith',
  'Number of dx codes: 234',
  'Number of pr codes: 567',
  'Charges: 1011')

Split the patients up into individual vectors:

patients <- split(txt, cumsum(grepl("^Patient Name", txt)))
str(patients)
# List of 2
#  $ 1: chr [1:8] "Patient Name- John Smith" "Number of dx codes: 123" "Number of pr codes: 678" "Charges: 910" ...
#  $ 2: chr [1:4] "Patient Name- Jane Smith" "Number of dx codes: 234" "Number of pr codes: 567" "Charges: 1011"

For each patient, parse out the relevant parts. This assumes that the order of lines (name, dx, pr, charge) is static, but it can easily be extended.

patients2 <- lapply(patients, function(pat) {
  nm <- sapply(strsplit(pat[1], "-")[[1]][-1], trimws)
  dx <- as.integer(strsplit(pat[2], ":")[[1]][2])
  pr <- as.integer(strsplit(pat[3], ":")[[1]][2])
  ch <- as.integer(strsplit(pat[4], ":")[[1]][2])
  rest <- paste(pat[-(1:4)], collapse="\n")
  data.frame(name = nm, dx = dx, pr = pr, charges = ch, rest = rest,
             stringsAsFactors = FALSE)
})
str(patients2)
# List of 2
#  $ 1:'data.frame':    1 obs. of  5 variables:
#   ..$ name   : chr "John Smith"
#   ..$ dx     : int 123
#   ..$ pr     : int 678
#   ..$ charges: int 910
#   ..$ rest   : chr "Lorem ipsum dolor sit amet, consectetur adipiscing elit. \nDuis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. \n"| __truncated__
#  $ 2:'data.frame':    1 obs. of  5 variables:
#   ..$ name   : chr "Jane Smith"
#   ..$ dx     : int 234
#   ..$ pr     : int 567
#   ..$ charges: int 1011
#   ..$ rest   : chr ""

Now combine into a single frame.

patients3 <- do.call(rbind.data.frame, patients2)
str(patients3)
# 'data.frame': 2 obs. of  5 variables:
#  $ name   : chr  "John Smith" "Jane Smith"
#  $ dx     : int  123 234
#  $ pr     : int  678 567
#  $ charges: int  910 1011
#  $ rest   : chr  "Lorem ipsum dolor sit amet, consectetur adipiscing elit. \nDuis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. \n"| __truncated__ ""

score 1 · Answer 3 · answered Jun 26 '18 at 18:38

if your text is indeed as you presented it, a continuous block, or a continuous string, this will do, using capture groups, assuming each record has dx, pr and charges:

library(stringr)
library(dplyr)
df <- " 
Patient Name- John Smith
Number of dx codes: 123
Number of pr codes: 678
Charges: 910
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Duis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. 
Donec interdum iaculis lacus. Nunc in placerat augue. 
In ut odio et dui aliquam sagittis at id augue. 
Patient Name- Jane Smith
Number of dx codes: 234
Number of pr codes: 567
Charges: 1011"

    df_b <- data.frame(dx=str_match_all(df, "(?<=dx codes:) [[:digit:]]*"), 
              pr=str_match_all(df, "(?<=pr codes:) [[:digit:]]*"),
    charges=str_match_all(df,"(?<=harges:) [[:digit:]]*")) 
    names(df_b) <- c("dx", "pr", "charges")
# it changed names by the structure but you may rename it easily:
df
    dx   pr charges
1  123  678     910
2  234  567    1011

Use R to read a text file and format extracted data in to a table

3 Answers3