2

I want to calculate the trend of each day over several years. For example the trend of May 1st from 2000 to 2010. Here is my test-dataframe:

library(lubridate)
date_list = seq(ymd('2000-01-15'),ymd('2010-09-18'),by='day')
testframe = data.frame(Date = date_list)
testframe$Day = substr(testframe$Date, start = 6, stop = 10)
testframe$V1 = rnorm(3900)
testframe$V2 = rnorm(3900)
testframe$V3 = seq(from = 10, to = 25, length.out = 3900)
testframe$V4 = seq(from = 5, to = 45, length.out = 3900)

V1 to V4 are the values. In testframe$Day I already cut out the day, so that I can use that to group the rows. I know that aggregate is good for grouping in this way, but I am pretty clueless how to combine this with an linear model.

In the end I would like to have a dataframe that has a column that contains each single day (without the year of course) and columns that contain the trend/slope of the values from V1 to V4.

Any ideas?

UPDATE:

To make it more clearly. I want and output that looks like this (Trends are random)

Day       V1 Trend   V2 Trend    V3 Trend   V4 Trend
01-01     +0.3          +0.4      +0.9        +0.5
01-02     +0.5          +0.3      +0.8        +0.4
01-03     -0.1          -0.2      +1.0        -0.3
01-04     +0.7          -0.7      +0.9        +0.9
......
......
12-30    -0.3           -0.4      +0.5        +0.8
12-31    -0.7           -0.3      +0.6        +0.9

p-values, Intercept and all would be also great to have.

I found this example, but its still not in the output that I want to have:

#Add year for lm    
testframe$Year = as.numeric(format(testframe$Date,'%Y'))
library(plyr)
# Break up d by state, then fit the specified model to each piece and
# return a list
models <- dlply(testframe, "Day", function(df) 
  lm(Year ~ V4, data = df))

# Apply coef to each model and return a data frame
ldply(models, coef)

# Print the summary of each model
l_ply(models, summary, .print = TRUE)
Mr.Spock
  • 511
  • 2
  • 13

2 Answers2

2

This provides a separate intercept and slope for each day of the year per V column. (yday is the day of the year 0, 1, 2, ... and ydayf is the same but as a factor and yr is the numeric 4 digit year.)

m <- as.matrix(testframe[-(1:2)])
yday <- as.POSIXlt(testframe$Date)$yday
ydayf <- factor(yday)
yr <- as.numeric(format(testframe$Date, "%Y"))

fm2 <- lm(m ~ ydayf + ydayf:yr + 0)
coef(fm2)
dummy.coef(fm2) # expand coefficients
summary(fm2)
broom::tidy(fm2) # data frame

If you want separate slopes but just one intercept then use per V column.

fm3 <- lm(m ~ ydayf:yr)
coef(fm3) 
dummy.coef(fm3) # expands coefficients
summary(fm3)
broom::tidy(fm3)  # data frame

If you want separate intercepts but just one slope per V column then:

fm4 <- lm(m ~ ydayf + yr + 0)
coef(fm4) 
dummy.coef(fm4) # expands coefficients
summary(fm4)
broom::tidy(fm4)  # data frame

The book Modern Applied Statistics with S Plus is a good reference for lm formulas.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
2

From your output it looks like for each Day you want to build a linear model of the form V ~ Year, for each V1, V2, V3, V4.

Here's a dplyr approach:

library(lubridate)
library(dplyr)

set.seed(23)  # for reproducibility

# data (using your code)
date_list = seq(ymd('2000-01-15'),ymd('2010-09-18'),by='day')
testframe = data.frame(Date = date_list)
testframe$Day = substr(testframe$Date, start = 6, stop = 10)
testframe$V1 = rnorm(3900)
testframe$V2 = rnorm(3900)
testframe$V3 = seq(from = 10, to = 25, length.out = 3900)
testframe$V4 = seq(from = 5, to = 45, length.out = 3900)


testframe %>% 
  mutate(Year = year(Date)) %>%  # extract the year
  select(-Date) %>%              # remove the Date column
  group_by(Day) %>%              # for each day
  summarise_at(vars(matches("V")), ~lm(. ~ Year)$coefficients[2])  # build a model and keep the slope

# # A tibble: 366 x 5
#   Day        V1      V2    V3    V4
#   <chr>   <dbl>   <dbl> <dbl> <dbl>
# 1 01-01  0.108   0.0554  1.41  3.75
# 2 01-02 -0.0543 -0.103   1.41  3.75
# 3 01-03 -0.143  -0.0176  1.41  3.75
# 4 01-04  0.146  -0.0232  1.41  3.75
# 5 01-05 -0.154  -0.0533  1.41  3.75
# 6 01-06 -0.268   0.0470  1.41  3.75
# 7 01-07 -0.164   0.0873  1.41  3.75
# 8 01-08  0.0634  0.266   1.41  3.75
# 9 01-09  0.0115 -0.0320  1.41  3.75
# 10 01-10  0.0576 -0.237   1.41  3.75
# # ... with 356 more rows

If you want to update the column names to something like v_trend you can use this instead:

summarise_at(vars(matches("V")), list(trend = ~lm(. ~ Year)$coefficients[2]))

Alternative (getting more info out of each model)

If you want to have more info about each linear model I'd recommend to use some data reshaping and the broom package like this:

library(lubridate)
library(tidyverse)
library(broom)

testframe %>% 
  mutate(Year = year(Date)) %>%  
  select(-Date) %>% 
  gather(v, value, -Day, -Year) %>%
  group_by(Day, v) %>%
  nest() %>%
  mutate(dd = map(data, ~tidy(lm(value ~ Year, data = .)))) %>%
  unnest(dd) %>%
  arrange(Day)

# # A tibble: 2,928 x 7
#     Day   v     term          estimate  std.error  statistic  p.value
#   <chr> <chr> <chr>            <dbl>      <dbl>      <dbl>    <dbl>
# 1 01-01 V1    (Intercept)  -217.     162.           -1.34  2.16e- 1
# 2 01-01 V1    Year            0.108    0.0806        1.34  2.16e- 1
# 3 01-01 V2    (Intercept)  -112.     196.           -0.570 5.84e- 1
# 4 01-01 V2    Year            0.0554   0.0976        0.567 5.86e- 1
# 5 01-01 V3    (Intercept) -2800.       0.260    -10756.    6.25e-30
# 6 01-01 V3    Year            1.41     0.000130  10824.    5.94e-30
# 7 01-01 V4    (Intercept) -7489.       0.694    -10787.    6.11e-30
# 8 01-01 V4    Year            3.75     0.000346  10824.    5.94e-30
# 9 01-02 V1    (Intercept)   109.     238.            0.458 6.59e- 1
# 10 01-02 V1    Year           -0.0543   0.119        -0.458 6.59e- 1
# # ... with 2,918 more rows

Then you can query this dataset and get anything you want. For example, if you save the above output as testframe2 you can get the trend / slope for day 01-01, column V1 like this:

testframe2 %>% filter(Day == "01-01" & v == "V1" & term == "Year") %>% pull(estimate)

and the p value of that slope like this:

testframe2 %>% filter(Day == "01-01" & v == "V1" & term == "Year") %>% pull(p.value)
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • Thanks! That is what I want! Is there a way to change the output in a way that I have summary(mod)[['coefficients']]['(Intercept)','Estimate'], summary(mod)[['coefficients']]['x','Estimate'], summary(mod)[['coefficients']]['x','Pr(>|t|)']) for each column? – Mr.Spock Jun 13 '19 at 13:01
  • And another question: vars(matches("V") doesnt work for my real dataframe since the columnnames doesnt start with V. Is there a way to select the columns by column-numbers? – Mr.Spock Jun 13 '19 at 13:09
  • 1
    I'm not sure I get your first question, but I updated my answer to give you all info from each linear model. For the second question I'd recommend you to use the actual names of the columns to make your process more robust (i.e. you never know if the column order might change in the future). You can use `vars()` and provide the column names. – AntoniosK Jun 13 '19 at 13:21