Extract last 4-digit number from a series in R using stringr

Question

I would like to flatten lists extracted from HTML tables. A minimal working example is presented below. The example depends on the stringr package in R. The first example exhibits the desired behavior.

years <- c("2005-", "2003-")
unlist(str_extract_all(years,"[[:digit:]]{4}"))

[1] "2005" "2003"

The below example produces an undesirable result when I try to match the last 4-digit number in a series of other numbers.

years1 <- c("2005-", "2003-", "1984-1992, 1996-")
unlist(str_extract_all(years1,"[[:digit:]]{4}$"))

character(0)

As I understand the documentation, I should include $ at the end of the pattern in order to request the match at the end of the string. I would prefer to match from the second example the numbers, "2005", "2003", and "1996".

`substr(years1,1,4)` provides a list of "2005" "2003" "1984" where I would like to obtain "2005", "2003", and "1996" — Daniel, Feb 20 '15 at 05:28
@jbaums, that definitely works, could you provide a resource/explanation for your solution? — Daniel, Feb 20 '15 at 05:29

jbaums · Answer 1 · 2015-09-26T22:51:24.390

15

You can use base R sub for this quite easily:

sub('.*(\\d{4}).*', '\\1', years1)

## [1] "2005" "2003" "1996"

The pattern to be matched here is .* (zero or more of any character) followed by \\d{4} (four consecutive numerals, which we capture by enclosing in parentheses), followed by zero or more characters.

sub replaces the matched pattern with the value in the second argument. In this case, \\1 indicates that we want to replace the whole matched pattern with the first captured substring (i.e. the four consecutive numerals).

Here regex is greedy, so it will bypass early matches of \\d{4}, consuming them with .*. Only the last sequence of four consecutive numerals is captured.

edited Sep 26 '15 at 22:51

answered Feb 20 '15 at 05:33

jbaums

27,115
5
79
119

This is very handy solution, I came across when faced with similar problem. How difficult would be to change the expression to match *first* four digits instead of the last ones? – Konrad Sep 26 '15 at 21:32
1

@Konrad - you could do that with `sub('\\D*(\\d{4}).*', '\\1', years1)`, where `\\D*` means zero or more characters that _aren't_ numerals. – jbaums Sep 26 '15 at 22:51

Rich Scriven · Accepted Answer · 2015-02-20T18:07:27.190

12

The stringi package has convenient functions that operate on specific parts of a string. So you can find the last occurrence of four consecutive digits with the following.

library(stringi)

x <- c("2005-", "2003-", "1984-1992, 1996-")

stri_extract_last_regex(x, "\\d{4}")
# [1] "2005" "2003" "1996"

Other ways to get the same result are

stri_sub(x, stri_locate_last_regex(x, "\\d{4}"))
# [1] "2005" "2003" "1996"

## or, since these count as words
stri_extract_last_words(x)
# [1] "2005" "2003" "1996"

## or if you prefer a matrix result
stri_match_last_regex(x, "\\d{4}")
#      [,1]  
# [1,] "2005"
# [2,] "2003"
# [3,] "1996"

edited Feb 20 '15 at 18:07

answered Feb 20 '15 at 05:30

Rich Scriven

97,041
11
181
245

1

I often find myself looking at your posts, thinking _I really need to familiarise myself with that package_... :) – jbaums Feb 20 '15 at 05:34
1

Thank you for the thorough response and exposure to `stringi` – Daniel Feb 20 '15 at 06:07

hwnd · Answer 3 · 2015-02-20T16:00:18.557

The end of string $ anchor asserts the position at the end of the string.

Saying, match exactly four digits at the end of the string. Unfortunately, what happens is that the digits try to get matched then the regex engine advances trying to assert that position and fails because there not at this position and consecutively backtracks trying to match them.

To fix this, you can greedily consume all characters until the last set of digits.

years1 <- c('2005-', '2003-', '1984-1992, 1996-')
unlist(str_extract_all(years1, perl('.*\\K\\d{4}')))
# [1] "2005" "2003" "1996"

score 1 · Answer 4 · answered Feb 20 '15 at 05:29

1

\\d{4}[^\\d]*$

Try this.This should do it for you.See demo.

https://regex101.com/r/kG5pN6/2

answered Feb 20 '15 at 05:29

vks

67,027
10
91
124

Extract last 4-digit number from a series in R using stringr

4 Answers4

Linked

Related