0

I am new to purrr and struggling to understand how to append the result of my function onto my dataframe (and get the best performance, since my dataframe is large).

I'm attempting to calculate sunrise time for each row in a dataframe:

library(tidyverse)
library(StreamMetabolism)

test <- structure(list(Latitude = c(44.49845, 42.95268, 42.95268, 44.49845,
44.49845, 44.49845), Longitude = c(-78.19259, -81.36935, -81.36935, -78.19259,
-78.19259, -78.19259), date = c("2014/02/12", "2014/01/24", "2014/01/08",
"2014/01/11", "2014/01/10", "2014/01/07"), timezone = c("EST5EDT", "EST5EDT",
"EST5EDT", "EST5EDT", "EST5EDT", "EST5EDT")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L))

sunRise <- function(Latitude, Longitude, date, timezone){
  print(sunrise.set(Latitude, Longitude, date, timezone, num.days = 1)[1,1])
}

I got this far, which gets me the desired sunrise times:

test %>% 
  pwalk(sunRise)

[1] "2014-02-12 07:17:09 EST"
[1] "2014-01-24 07:47:55 EST"
[1] "2014-01-08 07:56:13 EST"
[1] "2014-01-11 07:47:38 EST"
[1] "2014-01-10 07:47:59 EST"
[1] "2014-01-07 07:48:48 EST"

But I can't seem to figure out how to get the results of my function appended on to the end of the "test" dataframe, say as another variable called "sunrise_time"...

test %>% 
  mutate(sunrisetime = pwalk(sunRise))

Error in mutate_impl(.data, dots) : Evaluation error: argument ".f" is missing, with no default.

Sidebar: if you can recommend a good purrr tutorial that worked for you, please include it in your answer!! There seems to be a lot to know about purrr and I'm not sure what to focus on as a first-timer.

Nova
  • 5,423
  • 2
  • 42
  • 62
  • 1
    As for a guide to purrr, I would say the functional programming chapters of the WIP version of Advanced R are great: https://adv-r.hadley.nz/fp.html – dave-edison Dec 05 '18 at 22:08
  • 1
    Another useful link here: https://paulvanderlaken.com/2018/12/05/learning-functional-programming-purrr/ – AntoniosK Dec 05 '18 at 22:16
  • That paulvanderlaken.com link is fantastic and contains links to many others. thanks. – Matt L. Dec 05 '18 at 22:28

2 Answers2

4

You don't really need purrr here. Here's a dplyr approach:

library(dplyr)
library(StreamMetabolism)

# updated function
sunRise <- function(Latitude, Longitude, date, timezone){
  sunrise.set(Latitude, Longitude, date, timezone, num.days = 1)[1,1]
}

test %>%
  rowwise() %>%
  mutate(sunrize_time = sunRise(Latitude, Longitude, date, timezone)) %>%
  ungroup()

# # A tibble: 6 x 5
#   Latitude Longitude date       timezone sunrize_time                  
#      <dbl>     <dbl> <chr>      <chr>    <dttm>             
# 1     44.5     -78.2 2014/02/12 EST5EDT  2014-02-12 07:17:09
# 2     43.0     -81.4 2014/01/24 EST5EDT  2014-01-24 07:47:55
# 3     43.0     -81.4 2014/01/08 EST5EDT  2014-01-08 07:56:13
# 4     44.5     -78.2 2014/01/11 EST5EDT  2014-01-11 07:47:38
# 5     44.5     -78.2 2014/01/10 EST5EDT  2014-01-10 07:47:59
# 6     44.5     -78.2 2014/01/07 EST5EDT  2014-01-07 07:48:48

Or if you want to use purr you can do:

library(tidyverse)

test %>%
  group_by(id = row_number()) %>%
  nest() %>%
  mutate(sunrise_time = map(data, ~sunRise(.x$Latitude, .x$Longitude, .x$date, .x$timezone))) %>%
  unnest()

# # A tibble: 6 x 6
#      id sunrise_time        Latitude Longitude date       timezone
#   <int> <dttm>                 <dbl>     <dbl> <chr>      <chr>   
# 1     1 2014-02-12 07:17:09     44.5     -78.2 2014/02/12 EST5EDT 
# 2     2 2014-01-24 07:47:55     43.0     -81.4 2014/01/24 EST5EDT 
# 3     3 2014-01-08 07:56:13     43.0     -81.4 2014/01/08 EST5EDT 
# 4     4 2014-01-11 07:47:38     44.5     -78.2 2014/01/11 EST5EDT 
# 5     5 2014-01-10 07:47:59     44.5     -78.2 2014/01/10 EST5EDT 
# 6     6 2014-01-07 07:48:48     44.5     -78.2 2014/01/07 EST5EDT 

You can remove the id column if you want.

Or, you can slightly change your function and do this:

# update function
sunRise <- function(Latitude, Longitude, date, timezone){
  return(list(sunrise_time = sunrise.set(Latitude, Longitude, date, timezone, num.days = 1)[1,1]))
}

# apply function to each row and create a dataframe
# bind columns with original dataset
pmap_df(test, sunRise) %>%
  cbind(test, .)

#   Latitude Longitude       date timezone        sunrise_time
# 1 44.49845 -78.19259 2014/02/12  EST5EDT 2014-02-12 07:17:09
# 2 42.95268 -81.36935 2014/01/24  EST5EDT 2014-01-24 07:47:55
# 3 42.95268 -81.36935 2014/01/08  EST5EDT 2014-01-08 07:56:13
# 4 44.49845 -78.19259 2014/01/11  EST5EDT 2014-01-11 07:47:38
# 5 44.49845 -78.19259 2014/01/10  EST5EDT 2014-01-10 07:47:59
# 6 44.49845 -78.19259 2014/01/07  EST5EDT 2014-01-07 07:48:48
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • Thanks @AntoniosK. I was using dplyr originally, but it was slow and I hoped `purrr` would speed things up. The dplyr code takes 2.5 hours on my real dataset. With a test dataset that takes about 2 minutes to process, your first solution is faster by 3 seconds than your third solution. The fastest solution I found (but only faster by 5 seconds) was to use your second function and call `pmap_dfr(list(test$Latitude, test$Longitude, test$date, test$timezone), sunRise)` and bind that onto my original dataframe. **Would love to know if there is a super fast way to do this!** – Nova Dec 06 '18 at 14:46
1

I like the solutions by @AntoniosK, but you are very close. This works, as long as the variables defined for the custom function are all contained in the dataframe:

test %>% 
  mutate(sunrise_time = pmap(., sunRise))

One purrr tutorial that's very helpful: Jenny Bryan Purrr Tutorial

Matt L.
  • 2,753
  • 13
  • 22
  • This doesn't work for me - I get `Error in mutate_impl(.data, dots) : Column 'sunrise_time' is of unsupported class data.frame` and no column added onto `test`. – Nova Dec 06 '18 at 14:01