Calculate the slope from a linear regression for each variable for each day (group)

Question

Example data:

The dataset has four columns: Time, Var1, Var2, Var3. The Time column grain is 1 minute, but the regression should be performed for each day.

Time <- format(seq(as.POSIXct("2018-02-01 23:12:00"), as.POSIXct("2018-02-25 08:32:00"), by="min"), tz = "EST")
df <- data.frame(Time, Var1=runif(length(Time)), Var2=runif(length(Time)), Var3=runif(length(Time)))

The question:

How to run linear regression for each variable for each day? The output is the slope for Var1, Var2, and Var3 for each day.

The close solution:

One close solution I can get is from this post. However, the ROC from the TTR package is not the "slope" based on the linear regression analysis.

Any ideas for this task -- calculate the slope for each variable for each day?

My solution:

df$Time <- as.Date(df$Time) 
df$year <- format(df$Time,format="%Y") 
df$mth <- format(df$Time,format="%m") 
df$day <- format(df$Time,format="%d") 
aggregate( df$Var1 ~ year + mth + day , df , SLOPE_FUNCTION ) 
aggregate( df$Var2 ~ year + mth + day , df , SLOPE_FUNCTION ) 
aggregate( df$Var3 ~ year + mth + day , df , SLOPE_FUNCTION )

Can you also show me how to create the SLOPE_FUNCTION based on the lm to yield the slope result and how to apply the aggregate to each column (i.e., Var1, Var2 and Var3) in one line code?

Time ~ Var1 for the column Var1, Time ~ Var2 for the column Var2, Time ~ Var3 for the column Var3. I just like to see their rate of change in each day. — Kuo-Hsien Chang, Mar 11 '18 at 13:29
So to be more clear, are you simply looking for `0.3618170 / 0.3409212` for `Var1` as the first row for `Var1` and `0.8796454 / 0.7510891` for `Var2` etc.? — JasonAizkalns, Mar 11 '18 at 13:34
I am looking for the slope from the linear model for Var1~Var3 in each day (2018-02-01, 2018-02-02, 2018-02-03, etc.). — Kuo-Hsien Chang, Mar 11 '18 at 14:36
you should post your solution as an answer, not as an edit to the question ... — Ben Bolker, Mar 12 '18 at 00:33
@BenBolker My solution is not fully working. I have to apply aggregate() to each column and I need to put lm() to the SLOPE_FUNCTION. I will look into it and I will post my solution as an answer once I can do that just using one line code. — Kuo-Hsien Chang, Mar 12 '18 at 03:23

JasonAizkalns · Answer 1 · 2018-03-11T17:06:17.987

If you are simply for Time over Time changes, you could do:

library(tidyverse)
as_data_frame(df) %>%
  mutate_if(is.numeric, funs(. / lag(.)))

# # A tibble: 33,681 x 4
#    Time                  Var1   Var2    Var3
#    <fct>                <dbl>  <dbl>   <dbl>
#  1 2018-02-01 18:12:00 NA     NA     NA     
#  2 2018-02-01 18:13:00  1.06   1.17   0.433 
#  3 2018-02-01 18:14:00  0.551  0.647  2.41  
#  4 2018-02-01 18:15:00  3.12   1.34   0.134 
#  5 2018-02-01 18:16:00  1.43   0.344  6.43  
#  6 2018-02-01 18:17:00  0.189  0.790  0.823 
#  7 2018-02-01 18:18:00  0.355  3.39   1.51  
#  8 2018-02-01 18:19:00  3.62   0.604  1.17  
#  9 2018-02-01 18:20:00  0.950  0.505  0.0213
# 10 2018-02-01 18:21:00  3.86   2.34  19.5   
# # ... with 33,671 more rows

If you'd like percentage changes, you would add -1 to the funs() argument:

as_data_frame(df) %>%
  mutate_if(is.numeric, funs(. / lag(.) - 1))

For a lm by day, by variable, I would utilize purrr and broom:

library(tidyverse)
library(lubridate)

as_data_frame(df) %>%
  mutate(Time = ymd_hms(Time)) %>%
  mutate(day = floor_date(Time, unit = "day")) %>%
  gather(variable, value, -day, -Time) %>%
  nest(-day, -variable) %>%
  mutate(model = map(data, ~lm(as.numeric(Time) ~ value, data = .))) %>%
  unnest(model %>% map(broom::tidy))

# # A tibble: 150 x 7
#    day                 variable term           estimate std.error    statistic p.value
#    <dttm>              <chr>    <chr>             <dbl>     <dbl>        <dbl>   <dbl>
#  1 2018-02-01 00:00:00 Var1     (Intercept)  1517518845       618  2457337      0     
#  2 2018-02-01 00:00:00 Var1     value               592      1091        0.543  0.588 
#  3 2018-02-02 00:00:00 Var1     (Intercept)  1517571312      1337  1134724      0     
#  4 2018-02-02 00:00:00 Var1     value              2902      2318        1.25   0.211 
#  5 2018-02-03 00:00:00 Var1     (Intercept)  1517661220      1369  1108633      0     
#  6 2018-02-03 00:00:00 Var1     value       -      3981      2333 -      1.71   0.0881
#  7 2018-02-04 00:00:00 Var1     (Intercept)  1517744983      1318  1151672      0     
#  8 2018-02-04 00:00:00 Var1     value              1170      2275        0.514  0.607 
#  9 2018-02-05 00:00:00 Var1     (Intercept)  1517833026      1369  1109079      0     
# 10 2018-02-05 00:00:00 Var1     value       -      2027      2303 -      0.880  0.379 
# # ... with 140 more rows

If you would strictly like the slopes, you can add %>% filter(term == "value") to the pipeline.

Finally, you may prefer to visualize this data. You can forgo the model building by using geom_smooth() with method = "lm" -- see below. Note: I'm filtering to just a few days because the plot gets busy quickly.

as_data_frame(df) %>%
  mutate(Time = ymd_hms(Time)) %>%
  mutate(day = floor_date(Time, unit = "day")) %>%
  filter(day <= ymd("2018-02-05")) %>%
  gather(variable, value, -day, -Time) %>%
  ggplot(., aes(x = Time, y = value, color = factor(day))) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ variable)

And alternatively, if you leverage interaction and group, you can plot things a bit differently depending on what you're after when it comes to interpretation:

as_data_frame(df) %>%
  mutate(Time = ymd_hms(Time)) %>%
  mutate(day = floor_date(Time, unit = "day")) %>%
  filter(day <= ymd("2018-02-05")) %>%
  gather(variable, value, -day, -Time) %>%
  ggplot(., aes(x = Time, y = value, color = variable, 
                group = interaction(variable, factor(day)))) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE)

Could you do the slope for each day? For example, one slope value (from lm) in 2018-02-01 for Var1, Var2 and Var3, respectively. and so on. — Kuo-Hsien Chang, Mar 11 '18 at 14:35
What do you think my solution below? df$Time <- as.Date(df$Time) df$year <- format(df$Time,format="%Y") df$mth <- format(df$Time,format="%m") df$day <- format(df$Time,format="%d") aggregate( df[,2:4] ~ Year + mth + dd , df , SLOPE_FUNCTION ) Can you also show me how create the SLOPE_FUNCTION based on the lm to yield the slope result? Your solution should work, but it is so complicated. — Kuo-Hsien Chang, Mar 11 '18 at 17:25
What if df consist of NAs in columns Var1, Var2, and Var3? I got error messages. — Kuo-Hsien Chang, Mar 11 '18 at 22:36
Depending on the situation, I would remove the `NA` values via `filter(!is.na(variable))`. — JasonAizkalns, Mar 11 '18 at 23:38

score 0 · Answer 2 · answered Mar 11 '18 at 17:26

You can do this with nlme::lmList once you organize the data properly.

library(tidyverse)
library(lubridate)
df2 <- df %>%
  ## reshape data to get Time repeated for each variable
  gather(var,value,-Time) %>%
  mutate(Time=ymd_hms(Time),   ## convert to date-time variable
         date=date(Time),      ## date info only
         timeval=Time-floor_date(Time,"day"),  ## time since beginning of day
         datevar=interaction(date,var))        ## date/var combo

Now you can fit all the date/var combos at once:

nlme::lmList(value~timeval|datevar,df2)

May I have your comments on my own solution in the original post? What do you think? Thx. — Kuo-Hsien Chang, Mar 11 '18 at 17:48

Calculate the slope from a linear regression for each variable for each day (group)

2 Answers2