1

I've been trying to mutate a dataset with a user-defined function that includes calls to str_locate and str_sub. The aim is to locate then extract the first digit within a sequence of 3 digits amongst strings, then add this digit (as a character) to a new column called Hundreds.

For example:

  • Given string '821': the string '8' is added to Hundreds.
  • Given string 'Af823.22', the string '8' is added to Hundreds.

Here is my function:

get_hundred <- function(s) {
  match_pos <- str_locate(s, "[0-9]{3}")
  return(str_sub(s, match_pos[1], match_pos[1]))

The first 20 rows of my data look like this:

df1 <- structure(list(call.number = c("372.35044 L4383", "344.049 C235", 
"344.410415 DIM", "346.944043 NEI", "808.0667 B2616", "363.6909945 CAST", 
"ABS 2015.0", "371.38 MACK", "372.1102 PRAW", "A823.3 WRIG/T", 
"havmf test", "[DENTISTRY] CROW", "[DENTISTRY] JAWS", "[DENTISTRY] LOWE", 
"[DENTISTRY] MOLA", "[DENTISTRY] SERI", "[DENTISTRY] SKUL", "[DENTISTRY] TEET", 
"[HEALTH]ANKL", "[HEALTH]FOOT"), num.items = c(1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

Filtering the data

In fact I'm only looking for particular forms of string within a large list of call.numbers. I believe the below str_detect is detecting the forms of string I want.

df2 <- df1 %>%
  filter(str_detect(call.number, "^[A-Z]?[A-Z|a-z]?[0-9]{3}.*"))

What am I doing wrong?

Now I do this:

df2 %>%
  mutate(Hundreds = get_hundred(call.number))

Doing this however puts an 'A' in the Hundreds column for row 9, where I expect to see an '8'. Yet, if I call get_hundred on "A823.3 WRIG/T" (the "equivalent string") the function does return an '8'.

get_hundred("A823.3 WRIG/T")

What is it I'm not understanding here?

Klew Lesse
  • 13
  • 2

1 Answers1

2

str_sub expects the start and end positions as arguments!

See ?str_locate: str_locate() returns an integer matrix with two columns and one row for each element of string. The first column, start, gives the position at the start of the match, and the second column, end, gives the position of the end.

See ?str_sub: start, end. A pair of integer vectors defining the range of characters to extract (inclusive).Alternatively, instead of a pair of vectors, you can pass a matrix to start. The matrix should have two columns, either labelled start and end, or start and length.

match_pos[, 1] will ensure that the start position from the matrix (by str_locate) is extracted, and the correct position is chosen by str_sub.

library(dplyr)
library(stringr)

get_hundred_tarjae <- function(s) {
  match_pos <- str_locate(s, "[0-9]{3}")
  return(str_sub(s, match_pos[, 1], match_pos[, 1]))
}


df2 <- df1 %>%
  filter(str_detect(call.number, "^[A-Z]?[A-Z|a-z]?[0-9]{3}.*"))

df2 %>%
  mutate(Hundreds = get_hundred_tarjae(call.number))

A tibble: 9 × 3
call.number      num.items Hundreds
<chr>                <dbl> <chr>   
1 372.35044 L4383          1 3       
2 344.049 C235             1 3       
3 344.410415 DIM           1 3       
4 346.944043 NEI           1 3       
5 808.0667 B2616           1 8       
6 363.6909945 CAST         1 3       
7 371.38 MACK              1 3       
8 372.1102 PRAW            1 3       
9 A823.3 WRIG/T            1 8  
TarJae
  • 72,363
  • 6
  • 19
  • 66