6

I read up on regular expressions and Hadley Wickham's stringr and dplyr packages but can't figure out how to get this to work.

I have library circulation data in a data frame, with the call number as a character variable. I'd like to take the initial capital letters and make that a new variable and the digits between the letters and period into a second new variable.

Call_Num
HV5822.H4 C47 Circulating Collection, 3rd Floor
QE511.4 .G53 1982 Circulating Collection, 3rd Floor
TL515 .M63 Circulating Collection, 3rd Floor
D753 .F4 Circulating Collection, 3rd Floor
DB89.F7 D4 Circulating Collection, 3rd Floor 
zx8754
  • 52,746
  • 12
  • 114
  • 209
Concept Delta
  • 187
  • 2
  • 10
  • It's not clear to me what your data look like exactly. Can you post code that generates the kind of data frame you're dealing with? – Claus Wilke Jul 07 '15 at 04:17

4 Answers4

4

Using the stringi package, this would be one option. Since your target stays at the beginning of the strings, stri_extract_first() would work pretty well. [:alpha:]{1,} indicates alphabet sequences which contain more than one alphabet. With stri_extract_first(), you can identify the first alphabet sequence. Likewise, you can find the first sequence of numbers with stri_extract_first(x, regex = "\\d{1,}").

x <- c("HV5822.H4 C47 Circulating Collection, 3rd Floor",
       "QE511.4 .G53 1982 Circulating Collection, 3rd Floor",
       "TL515 .M63 Circulating Collection, 3rd Floor",
       "D753 .F4 Circulating Collection, 3rd Floor",
       "DB89.F7 D4 Circulating Collection, 3rd Floor")

library(stringi)

data.frame(alpha = stri_extract_first(x, regex = "[:alpha:]{1,}"), 
           number = stri_extract_first(x, regex = "\\d{1,}"))

#  alpha number
#1    HV   5822
#2    QE    511
#3    TL    515
#4     D    753
#5    DB     89
jazzurro
  • 23,179
  • 35
  • 66
  • 76
  • Thanks jazzurro, it's working great! Here's the code I adapted for my specific data frame called "circ_data: circ_data_new <- transform(circ_data, Call_Num_Alpha = (stri_extract_first(circ_data$Call_Num, regex = "[:alpha:]{1,}"))) AND circ_data_new <- transform(circ_data_new, Call_Num_Number = (stri_extract_first(circ_data$Call_Num, regex = "\\d{1,}"))) – Concept Delta Jul 08 '15 at 02:02
  • There was just one little problem - when it created new variables it made them both factors. Could you suggest how to make the first a character type and the second an integer type? – Concept Delta Jul 08 '15 at 02:06
  • @ConceptDelta Thanks for your comment. You want to use `as.character()` and wrap the code. For example, `alpha = as.character(stri_extract_first(x, regex = "[:alpha:]{1,}"))`. Hope this helps you. – jazzurro Jul 08 '15 at 02:51
  • Hi Jazzurro. I tried that: circ_data <- transform(circ_data, as.character(Call_Num_Alpha = (stri_extract_first(circ_data$Call_Num, regex = "[:alpha:]{1,}")))) but ended up with this error: Error in as.character(Call_Num_Alpha = (stri_extract_first(circ_data$Call_Num, : supplied argument name 'Call_Num_Alpha' does not match 'x'. Can you see what I'm doing wrong? – Concept Delta Jul 08 '15 at 03:23
  • @ConceptDelta You have too many brackets. I think `Call_Num_Alpha = as.character(stri_extract_first(circ_data$Call_Num, regex = "[:alpha:]{1,}"))` will work. Let me know if you need more help. – jazzurro Jul 08 '15 at 08:40
  • Dooh! (smacks head). Thanks so much - you Rock! :-) – Concept Delta Jul 08 '15 at 13:50
  • @ConceptDelta It seems that you found your way now. I am glad to see that. :) – jazzurro Jul 08 '15 at 13:53
3

If you want to use stringr, the solution would probably look something like this:

df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))

require(stringr)

matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])
df2
##                                                  Call_Num letter number
## 1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
## 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
## 3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
## 4          D753 .F4 Circulating Collection, 3rd Floor      D    753
## 5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

I don't think that sticking the str_match() call into mutate() of dplyr is worth the effort, so I'd just leave it at that. Or use rawr's solution.

Community
  • 1
  • 1
Claus Wilke
  • 16,992
  • 7
  • 53
  • 104
2

what about

rl <- read.table(header = TRUE, text = "Call_Num
'HV5822.H4 C47 Circulating Collection, 3rd Floor'
                 'QE511.4 .G53 1982 Circulating Collection, 3rd Floor'
                 'TL515 .M63 Circulating Collection, 3rd Floor'
                 'D753 .F4 Circulating Collection, 3rd Floor'
                 'DB89.F7 D4 Circulating Collection, 3rd Floor'",
                 stringsAsFactors = FALSE)
cbind(rl, read.table(text = gsub('([A-Z]+)([0-9]+).*', '\\1 \\2', rl$Call_Num)))

#                                              Call_Num V1   V2
# 1     HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
# 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE  511
# 3        TL515 .M63 Circulating Collection, 3rd Floor TL  515
# 4          D753 .F4 Circulating Collection, 3rd Floor  D  753
# 5        DB89.F7 D4 Circulating Collection, 3rd Floor DB   89
rawr
  • 20,481
  • 4
  • 44
  • 78
2

You can use strapply from the gsubfn package:

library(gsubfn)

m <- strapply(as.character(df$Call_Num), '^([A-Z]+)(\\d+)', 
     ~ c(id = x, num = y), simplify = rbind)

X <- as.data.frame(m, stringsAsFactors = FALSE)

#   id  num
# 1 HV 5822
# 2 QE  511
# 3 TL  515
# 4  D  753
# 5 DB   89
hwnd
  • 69,796
  • 4
  • 95
  • 132