1

So I've bought a book on R and automated data collection, and one of the first examples are leaving me baffled.

I have a table with a date-column consisting of numbers looking like this "2001-". According to the tutorial, the line below will remove the "-" from the dates by singling out the first four digits:

yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]4$"))

When I run this command, "yend_clean" is simply set to "character (empty)".

If I remove the ”4$", I get all of the dates split into atoms so that the list that originally looked like this "1992", "2003" now looks like this "1", "9" etc.

So I suspect that something around the "4$" is the problem. I can't find any documentation on this that helps me figure out the correct solution.

Was hoping someone in here could point me in the right direction.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
TheRecruit
  • 184
  • 9

2 Answers2

6

This is a regular expression question. Your regular expression is wrong. Use:

unlist(str_extract_all("2003-", "^[[:digit:]]{4}"))

or equivalently

sub("^(\\d{4}).*", "\\1", "2003-")

of if really all you want is to remove the "-"

sub("-", "", "2003-")

Repetition in regular expressions is controlled by the {} parameter. You were missing that. Additionally $ means match the end of the string, so your expression translates as:

match any single digit, followed by a 4, followed by the end of the string

When you remove the "4", then the pattern becomes "match any single digit", which is exactly what happens (i.e. you get each digit matched separately).

The pattern I propose says instead:

match the beginning of the string (^), followed by a digit repeated four times.

The sub variation is a very common technique where we create a pattern that matches what we want to keep in parentheses, and then everything else outside of the parentheses (.* matches anything, any number of times). We then replace the entire match with just the piece in the parens (\\1 means the first sub-expression in parentheses). \\d is equivalent to [[:digit:]].

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • Thank you for that elaborate answer! In order to keep my syntax as close to the tutorial as possible, I found out that using the braces instead of the $ solved my problem. My code now reads: yend_clean <- unlist(str_extract_all(danger_table$yend, "^[[:digit:]]{4}")) – TheRecruit May 08 '15 at 12:58
2

If you mean the book Automated Data Collection with R, the code could be like this:

yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]{4}[-]$"))
yend_clean <- unlist(str_extract_all(yend_clean, "^[[:digit:]]{4}"))

Assumes that you have a string, "1993–2007, 2010-", and you want to get the last given year, which is "2010". The first line, which means four digits and a dash and end, return "2010-", and the second line return "2010".

Wenting.W
  • 37
  • 6
  • This needed tweaking for me in 2020. The source has changed slightly on the Wikipedia page. Here's the code that enabled me to extract the first year (rather than the last year). `> yend_clean <- str_extract_all(danger_table$yend, "^[[:digit:]]{4}") > danger_table$yend <- as.numeric(yend_clean) > danger_table$yend [1] 2001 1992 2013 2013 2013 2013 2016 2016 2016 2003 1986 2014 2005 2013 2003 2013 1993 2012 1984 2015 2017 2016 2017 2000 2019 1997 2018 2012 1997 2002 [31] 2006 1992 2016 2007 1997 1982 2015 2016 2016 2014 2015 2010 1996 2016 1999 2007 2014 2013 2012 2012 2010 2011 1994` – JulianHarty Apr 02 '20 at 12:00