4

I have the following R data frame:

zed
# A tibble: 10 x 3
   jersey_number first_name statistics.minutes
   <chr>         <chr>      <chr>             
 1 20            Marques    8:20              
 2 53            Brennan    00:00             
 3 35            Marvin     40:00             
 4 50            Justin     00:00             
 5 14            Jordan     00:00             
 6 1             Trevon     31:00             
 7 15            Alex       2:00              
 8 51            Mike       00:00             
 9 12            Javin      17:00             
10 3             Grayson    38:00     

> dput(zed)
structure(list(jersey_number = c("20", "53", "35", "50", "14", 
"1", "15", "51", "12", "3"), first_name = c("Marques", "Brennan", 
"Marvin", "Justin", "Jordan", "Trevon", "Alex", "Mike", "Javin", 
"Grayson"), statistics.minutes = c("8:20", "00:00", "40:00", 
"00:00", "00:00", "31:00", "2:00", "00:00", "17:00", "38:00")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

This is the format of the data from the API I am receiving it from. All of the columns (there are ~100 cols) are initially of class character. To convert everything, I use readr::type_convert(), but the following error happens:

> zed %>% readr::type_convert()
Parsed with column specification:
cols(
  jersey_number = col_integer(),
  first_name = col_character(),
  statistics.minutes = col_time(format = "")
)
# A tibble: 10 x 3
   jersey_number first_name statistics.minutes
           <int> <chr>      <time>            
 1            20 Marques    08:20             
 2            53 Brennan    00:00             
 3            35 Marvin        NA             
 4            50 Justin     00:00             
 5            14 Jordan     00:00             
 6             1 Trevon        NA             
 7            15 Alex       02:00             
 8            51 Mike       00:00             
 9            12 Javin      17:00             
10             3 Grayson       NA 

Instead of throwing errors and messing up the conversion, I would like it if this minutes column instead turned into class == numeric. If a row shows '8:20' for this column, I'd like this to simply be converted into 8.33.

Any thoughts on how I can do this - preferably something that allows me to continue using type_convert.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Canovice
  • 9,012
  • 22
  • 93
  • 211

2 Answers2

3

library(lubridate)

Read in df with no alterations (your dput code).

Add hours to the minute-seconds:

df$statistics.minutes <- paste0("00:", df$statistics.minutes)

Convert to a time type:

df$statistics.minutes <- lubridate::hms(df$statistics.minutes)

Divide by 60:

period_to_seconds(df$statistics.minutes) / 60

Result:

 [1]  8.333333  0.000000 40.000000  0.000000  0.000000
 [6] 31.000000  2.000000  0.000000 17.000000 38.000000

Replace in df, if desired:

df$statistics.minutes <- period_to_seconds(df$statistics.minutes) / 60

[OP's addition] :-)

I've created the following helper function - based on this result - so I can fix the issue without breaking my pipe chain:

fixMinutes <- function(raw.data) {

  new.raw.data <- raw.data %>%
    dplyr::mutate(statistics.minutes = paste0("00:", statistics.minutes)) %>%
    dplyr::mutate(statistics.minutes = lubridate::hms(statistics.minutes)) %>%
    dplyr::mutate(statistics.minutes = lubridate::period_to_seconds(statistics.minutes) / 60)

  return(new.raw.data)
}

zed %>% 
  ... %>% 
  fixMinutes() %>%
  ... %>%
Marian Minar
  • 1,344
  • 10
  • 25
  • Commenting on my own solution: Knowing a bit more now, a better solution would be something like `zed %>% separate()` then apply `as.numeric`, convert to minutes and merge the two columns. If, in the future, these periods will be required to be added to POSIXct elements, it would be good to use lubridate, otherwise lubridate is not necessary here. – Marian Minar Jun 06 '19 at 14:16
2

The only thing that occurs to me is to convert the offending column to numeric first, something like

(zed 
   ## split stats column in two, with names unlikely to clash w/ existing
   %>% tidyr::separate(statistics.minutes,c("tmp...mins","tmp...secs"))
   ## explicitly convert
   %>% dplyr::mutate(statistics.minutes=as.numeric(tmp...mins)+as.numeric(tmp...secs)/60)
   ## throw out the temp variables
   %>% dplyr::select(-starts_with("tmp..."))
   %>% readr::type_convert()
)

I don't know if that satisfies your "continue using type_convert" criterion. It would be more elegant to pass a custom conversion function to type_convert, but I don't know how to do that.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453