Cleaning 'stringr str_replace_all' automatic concatenation when matching multiple times

Question

I used police_officer <- str_extract_all(txtparts, "ID:.*\n") to extract all the names of the police officers involved in a 911 call from a text file. example:
2237 DISTURBANCE Report taken Call Taker: Telephone Operators Sharon L Moran Location/Address: [BRO 6949] 61 WILSON ST ID: Patrolman Darvin Anderson Disp-22:43:39 Arvd-22:48:57 Clrd-23:49:45 ID: Patrolman Stephen T Pina Disp-22:43:48 Clrd-22:46:10 ID: Sergeant Michael V Damiano Disp-22:46:33 Arvd-22:47:14 Clrd-22:55:22

In some parts when it matches more than one ID: I get: "c(\" Patrolman Darvin Anderson\\n\", \" Patrolman Stephen T Pina\\n\", \" Sergeant Michael V Damiano\\n\")". Here is what I have tried so far to clean the data:
police_officer <- str_replace_all(police_officer,"c\\(.","") police_officer <- str_replace_all(police_officer,"\\)","") police_officer <- str_replace_all(police_officer,"ID:","") police_officer <- str_replace_all(police_officer,"\\n\","") # I can't get rid of\\n\.

this is what I end up with
" Patrolman Darvin Anderson\\n\", \" Patrolman Stephen T Pina\\n\", \" Sergeant Michael V Damiano\\n\""

I need help cleaning \\n\.

Wiktor Stribiżew · Accepted Answer · 2016-03-03T22:50:54.440

1

You can use the following regex with str_match_all:

\bID:\s*(\w+(?:\h+\w+)*)

See the regex demo

> txt <- "Call Taker:    Telephone Operators Sharon L Moran\n  Location/Address:    [BRO 6949] 61 WILSON ST\n                ID:    Patrolman Darvin Anderson\n                       Disp-22:43:39                 Arvd-22:48:57  Clrd-23:49:45\n                ID:    Patrolman Stephen T Pina\n                       Disp-22:43:48                                Clrd-22:46:10\n                ID:    Sergeant Michael V Damiano\n                       Disp-22:46:33                 Arvd-22:47:14  Clrd-22:55:22"
> str_match_all(txt, "\\bID:\\s*(\\w+(?:\\h+\\w+)*)")
[[1]]
     [,1]                                [,2]                        
[1,] "ID:    Patrolman Darvin Anderson"  "Patrolman Darvin Anderson" 
[2,] "ID:    Patrolman Stephen T Pina"   "Patrolman Stephen T Pina"  
[3,] "ID:    Sergeant Michael V Damiano" "Sergeant Michael V Damiano"

The regex matches ID: as a whole word, then matches zero or more whitespace (with \s*) and then captures sequences of alphanumerics characters optionally separated with horizontal whitespace. str_match_all helps extract the captured parts, so, you can't use str_extract_all with this regex.

Update:

> time <- str_trim(str_extract(txt, " [[:digit:]]{4}"))
> Call_taker <- str_replace_all(str_extract(txt, "Call Taker:.*\n"),"Call Taker:","" ) %>% str_replace_all("\n","")
> address <- str_extract(txt, "Location/Address:.*\n")
> Police_officer <- str_match_all(txt, "\\bID:\\s*(\\w+(?:\\h+\\w+)*)")
> BPD_log <- cbind(time,Call_taker,address,list(Police_officer[[1]][,2]))
> BPD_log <- as.data.frame(BPD_log)
> colnames(BPD_log) <- c("time", "Call_taker", "address", "Police_officer")
> BPD_log
  time                             Call_taker                                        address
1 6949     Telephone Operators Sharon L Moran Location/Address:    [BRO 6949] 61 WILSON ST\n
                                                                   Police_officer
1 Patrolman Darvin Anderson, Patrolman Stephen T Pina, Sergeant Michael V Damiano
>

edited Mar 03 '16 at 22:50

answered Mar 03 '16 at 15:28

Wiktor Stribiżew

607,720
39
448
563

Thanks! I guess the real problem is when I bring everything into a data frame with `Call_taker`, `time`, `address`, and `Police_officer` . `time <- str_trim(str_extract(txt, " [[:digit:]]{4}")) Call_taker <- str_replace_all(str_extract(txt, "Call Taker:.*\n"),"Call Taker:","" ) %>% str_replace_all("\n","") address <- str_extract(txt, "Location/Address:.*\n") Police_officer <- str_match_all(txt, "\\bID:\\s*(\\w+(?:\\h+\\w+)*)") BPD_log <- cbind(time,Call_taker,address,Police_officer) BPD_log <- as.data.frame(BPD_log)` we still get `c(` when we bring Police_officer – Jomisilfe Mar 03 '16 at 20:58
I do not know what your final data frame should look like, but note you just added the whole output from `str_match_all` while you only need the `[,2]` dimension. Try `BPD_log <- cbind(time,Call_taker,address,Police_officer[[1]][,2])`. – Wiktor Stribiżew Mar 03 '16 at 21:36
Just saw your update, but I would like the data to be presented under one row, which means all the police officers should be in one cell. if you could do that, that'd be great. – Jomisilfe Mar 03 '16 at 22:31
What about `BPD_log <- cbind(time,Call_taker,address,list(Police_officer[[1]][,2]))`? – Wiktor Stribiżew Mar 03 '16 at 22:49
Note that I set the column names with `colnames(BPD_log) <- c("time", "Call_taker", "address", "Police_officer")`. You can adjust as needed. Answer is updated. – Wiktor Stribiżew Mar 03 '16 at 22:55
Thank you! Now, do you know how to extract everything after the line `Location/Address: ` – Jomisilfe Mar 03 '16 at 23:02
1

I am not sure what you need. [`(?s)Location\/Address:[^\n]*\R(.*)`](https://regex101.com/r/mZ7pS4/1)? – Wiktor Stribiżew Mar 03 '16 at 23:29
I have posted the full question here http://stackoverflow.com/questions/35785287/parsing-semi-structured-data. Your last answer is giving me a error `Location\/Address`. it's saying `> str_match_all(txtparts, " (?s)Location\/Address:[^\n]*\R(.*)") Error: '\/' is an unrecognized escape in character string starting "" (?s)Location\/"` – Jomisilfe Mar 03 '16 at 23:40
That is a regex, not the R code. In R, when declaring a regex. you needn't escape the slash.and backslashes must be doubled. – Wiktor Stribiżew Mar 04 '16 at 07:25

Cleaning 'stringr str_replace_all' automatic concatenation when matching multiple times

1 Answers1