7

I have a vector filled with strings of the following format: <year1><year2><id1><id2>

the first entries of the vector looks like this:

199719982001
199719982002
199719982003
199719982003

For the first entry we have: year1 = 1997, year2 = 1998, id1 = 2, id2 = 001.

I want to write a regular expression that pulls out year1, id1, and the digits of id2 that are not zero. So for the first entry the regex should output: 199721.

I have tried doing this with the stringr package, and created the following regex:

"^\\d{4}|\\d{1}(?<=\\d{3}$)"

to pull out year1 and id1, however when using the lookbehind i get a "invalid regular expression" error. This is a bit puzzling to me, can R not handle lookaheads and lookbehinds?

Greg
  • 481
  • 1
  • 5
  • 21
Thomas Jensen
  • 860
  • 3
  • 11
  • 26
  • 4
    look at the help page `regex`. Lookbehind is supported for `perl=TRUE`. So `regexp("^\\d{4}|\\d{1}(?<=\\d{3}$)",s)` does not throw an error, but does not select what you want. – mpiktas Jan 12 '12 at 12:02
  • Thanks for the tip! I knew that the regex would not capture all, I was just experimenting a bit - and got stomped when I kept getting an "invalid regular expression" message. – Thomas Jensen Jan 12 '12 at 14:46
  • With `strapply` in gsubfn this regular expression works and does not require lookahead or lookbehind: `L <- c("199719982001", "199719982002", "199719982003", "199719982003"); library(gsubfn); strapply(L, "^(....)....(.)0*(.*)", c, simplify = TRUE)` – G. Grothendieck Jan 12 '12 at 15:34

3 Answers3

9

Since this is fixed format, why not use substr? year1 is extracted using substr(s,1,4), id1 is extracted using substr(s,9,9) and the id2 as as.numeric(substr(s,10,13)). In the last case I used as.numeric to get rid of the zeroes.

mpiktas
  • 11,258
  • 7
  • 44
  • 57
9

You will need to use gregexpr from the base package. This works:

> s <- "199719982001"
> gregexpr("^\\d{4}|\\d{1}(?<=\\d{3}$)",s,perl=TRUE)
[[1]]
[1]  1 12
attr(,"match.length")
[1] 4 1
attr(,"useBytes")
[1] TRUE

Note the perl=TRUE setting. For more details look into ?regex.

Judging from the output your regular expression does not catch id1 though.

mpiktas
  • 11,258
  • 7
  • 44
  • 57
1

You can use sub.

sub("^(.{4}).{4}(.{1}).*([1-9]{1,3})$","\\1\\2\\3",s)
Wojciech Sobala
  • 7,431
  • 2
  • 21
  • 27