2

I have a data.frame bbm with variables ticker, variable and value. I want to apply an Augmented Dickey Fuller test via the adf.test function grouped by ticker and variable. R should add a new column to the initial data.frame with the corresponding p-values.

I tried

x <- with(bbm, tapply(value, list(ticker, variable), adf.test$p.value))
cbind(bbm, x)

This yields Error in adf.test$p.value : object of type 'closure' is not subsettable.

Then I tried

x <- with(bbm, tapply(value, list(ticker, variable), as.list(adf.test)$p.value))
cbind(bbm, x)

This yields a result, but in the new column is not what I want. Even when I Change p.value on the code to method it stills yields some odd number.

Then I tried using ddply:

bbm<-ddply(bbm, .(ticker, variable), mutate, df=adf.test(value)$p.value)

which yields Error: wrong embedding Dimension.

How can I solve this? Any suggestions?

Here's an sample of the df:

            ticker                    variable   value
1  1002Z AV Equity        BS_CUSTOMER_DEPOSITS 29898.0
2  1002Z AV Equity        BS_CUSTOMER_DEPOSITS 31302.0
3  1002Z AV Equity        BS_CUSTOMER_DEPOSITS 29127.0
4  1002Z AV Equity        BS_CUSTOMER_DEPOSITS 24056.0
5  1002Z AV Equity        BS_CUSTOMER_DEPOSITS 22080.0
6  1002Z AV Equity        BS_CUSTOMER_DEPOSITS 22585.0
7  1002Z AV Equity        BS_CUSTOMER_DEPOSITS 22674.0
8  1002Z AV Equity        BS_CUSTOMER_DEPOSITS 21733.0
9  1002Z AV Equity        BS_CUSTOMER_DEPOSITS 22016.0
10 1002Z AV Equity        BS_CUSTOMER_DEPOSITS 21999.0
11 1002Z AV Equity        BS_CUSTOMER_DEPOSITS 22013.0
12 1002Z AV Equity        BS_CUSTOMER_DEPOSITS 21135.0
13 1002Z AV Equity                 BS_TOT_LOAN 28476.0
14 1002Z AV Equity                 BS_TOT_LOAN 29446.0
15 1002Z AV Equity                 BS_TOT_LOAN 29273.0
16 1002Z AV Equity                 BS_TOT_LOAN 27579.0
17 1002Z AV Equity                 BS_TOT_LOAN 20769.0
18 1002Z AV Equity                 BS_TOT_LOAN 21370.0
19 1002Z AV Equity                 BS_TOT_LOAN 22306.0
20 1002Z AV Equity                 BS_TOT_LOAN 21013.0
21 1002Z AV Equity                 BS_TOT_LOAN 21810.0
22 1002Z AV Equity          BS_TIER1_CAP_RATIO     6.5
23 1002Z AV Equity          BS_TIER1_CAP_RATIO     6.2
24 1002Z AV Equity          BS_TIER1_CAP_RATIO     7.9
25 1002Z AV Equity          BS_TIER1_CAP_RATIO     9.2
26 1002Z AV Equity          BS_TIER1_CAP_RATIO     8.5
27 1002Z AV Equity          BS_TIER1_CAP_RATIO     6.6
28 1002Z AV Equity          BS_TIER1_CAP_RATIO     9.6
29 1002Z AV Equity BS_TOT_CAP_TO_RISK_BASE_CAP    11.5
30 1002Z AV Equity BS_TOT_CAP_TO_RISK_BASE_CAP    10.9



 > dput(head(select(bbm, ticker, variable, value), 30))
structure(list(ticker = c("1002Z AV Equity", "1002Z AV Equity", 
"1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", 
"1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", 
"1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", 
"1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", 
"1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", 
"1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", 
"1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity", "1002Z AV Equity"
), variable = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 8L, 8L, 8L, 8L, 
8L, 8L, 8L, 9L, 9L), .Label = c("PX_LAST", "PE_RATIO", "VOL_MEAN", 
"BS_CUSTOMER_DEPOSITS", "BS_TOT_LOAN", "*", "RN366", "BS_TIER1_CAP_RATIO", 
"BS_TOT_CAP_TO_RISK_BASE_CAP", "RETURN_COM_EQY", "BS_LEV_RATIO_TO_TANG_CAP",
"NPLS_TO_TOTAL_LOANS"), class = "factor"), value = c(29898, 31302, 
29127, 24056, 22080, 22585, 22674, 21733, 22016, 21999, 22013, 
21135, 28476, 29446, 29273, 27579, 20769, 21370, 22306, 21013, 
21810, 6.5, 6.2, 7.9, 9.2, 8.5, 6.6, 9.6, 11.5, 10.9)), .Names = c("ticker", 
"variable", "value"), row.names = c(NA, 30L), class = "data.frame")

Oh, and also using the analogue dplyr function yields the same error as ddply.

Christoph
  • 101
  • 1
  • 9
  • Can you add sample data from your dataframe? – Joswin K J Aug 25 '15 at 09:35
  • Just did (hope this is alright). – Christoph Aug 25 '15 at 09:41
  • can you show the `head(dput())` of the dataset instead? – erasmortg Aug 25 '15 at 09:55
  • Tried it - I'm new to this, sorry. – Christoph Aug 25 '15 at 10:05
  • What is the result that you are looking for? The `adf.test` takes a time series object and the statistic will be one result per time series that says whether the series, as a whole will have a unit root. Do you want to show the same statistic per group? – erasmortg Aug 25 '15 at 10:14
  • `wrong embedding dimension` could mean that one of your factors does not have enough data for the test (you might only two or 1 observations). Try, for instance, running the adf.test with your subsets like so: `adf.test(bbm$value[1:12]);adf.test(bbm$value[13:21]);adf.test(bbm$value[22:28]);adf.test(bbm$value[29:30])` Only the last one will give an error – erasmortg Aug 25 '15 at 10:42
  • Exactly, I don't really mind whether its even in the same data.frame, but I need something that tells me which group (i.e., combination of `ticker` and `variable`) has a unit root. – Christoph Aug 25 '15 at 10:45

2 Answers2

3

Here is a tidyverse solution:

bbm %>% 
    group_by(ticker,variable) %>% 
    summarise(pval = ifelse(n() <= 3,NA, adf.test(value)$p.value))

# A tibble: 4 x 3
# Groups:   ticker [?]
  ticker          variable                       pval
  <chr>           <fct>                         <dbl>
1 1002Z AV Equity BS_CUSTOMER_DEPOSITS         0.01  
2 1002Z AV Equity BS_TOT_LOAN                  0.951 
3 1002Z AV Equity BS_TIER1_CAP_RATIO           0.0118
4 1002Z AV Equity BS_TOT_CAP_TO_RISK_BASE_CAP NA     
Warning message:
In adf.test(value) : p-value smaller than printed p-value

You can just use the base R ifelse function to check if there exists less than 3 points in each group (which would set the pval to NA) otherwise you can run adf.test

I had a play with it and it appears @erasmortg appears to be correct. The error "embedding" comes from not having enough data points to actually run the adf.test function.

This requires atleast four data points:

> adf.test(rnorm(1))
Error in embed(y, k) : wrong embedding dimension
> adf.test(rnorm(2))
Error in embed(y, k) : wrong embedding dimension
> adf.test(rnorm(3))
Error in res.sum$coefficients[2, 1] : subscript out of bounds
> adf.test(rnorm(4))

    Augmented Dickey-Fuller Test

data:  rnorm(4)
Dickey-Fuller = NaN, Lag order = 1, p-value = NA
alternative hypothesis: stationary
Vivek Katial
  • 543
  • 4
  • 17
2

It seems that the problem might be with a group that is too small to handle. An option to deal with this is creating a custom function to catch the error (with tryCatch and, pass this function via a lapply() call, like so:

testx <- function (x) {
  return(tryCatch(adf.test(x), error=function(e) NULL))
}

g<- lapply(split(bbm, bbm$variable), function(x) testx(x$value))
str(g)
#List of 12
# $ PX_LAST                    : NULL
# $ PE_RATIO                   : NULL
# $ VOL_MEAN                   : NULL
# $ BS_CUSTOMER_DEPOSITS       :List of 6
# ..$ statistic  : Named num -4.86
#  .. ..- attr(*, "names")= chr "Dickey-Fuller"
#  ..$ parameter  : Named num 2
#  .. ..- attr(*, "names")= chr "Lag order"
#  ..$ alternative: chr "stationary"
#  ..$ p.value    : num 0.01
#  ..$ method     : chr "Augmented Dickey-Fuller Test"
#  ..$ data.name  : chr "x"
#  ..- attr(*, "class")= chr "htest"
# $ BS_TOT_LOAN                :List of 6
#  ..$ statistic  : Named num -0.784
#  .. ..- attr(*, "names")= chr "Dickey-Fuller"
#  ..$ parameter  : Named num 2
#  .. ..- attr(*, "names")= chr "Lag order"
#  ..$ alternative: chr "stationary"
#  ..$ p.value    : num 0.951
#  ..$ method     : chr "Augmented Dickey-Fuller Test"
#  ..$ data.name  : chr "x"
#  ..- attr(*, "class")= chr "htest"
# $ *                          : NULL
# $ RN366                      : NULL
# $ BS_TIER1_CAP_RATIO         :List of 6
#  ..$ statistic  : Named num -4.33
#  .. ..- attr(*, "names")= chr "Dickey-Fuller"
#  ..$ parameter  : Named num 1
#  .. ..- attr(*, "names")= chr "Lag order"
#  ..$ alternative: chr "stationary"
#  ..$ p.value    : num 0.0118
#  ..$ method     : chr "Augmented Dickey-Fuller Test"
#  ..$ data.name  : chr "x"
#  ..- attr(*, "class")= chr "htest"
# $ BS_TOT_CAP_TO_RISK_BASE_CAP: NULL
# $ RETURN_COM_EQY             : NULL
# $ BS_LEV_RATIO_TO_TANG_CAP   : NULL
# $ NPLS_TO_TOTAL_LOANS        : NULL

This will create a list object g of length 12 (one per factor), where, for the valid adf.test calls, the element is populated by the relevant characteristics, and for the rest NULL is passed.

If the parameter of interest is only the p.value per group, the previous lapply can be wrapped around a sapply() to get the following object:

h<- sapply(lapply(split(bbm, bbm$variable), function(x) testx(x$value)), function(x) print(x$p.value))
str(h)
#List of 12
# $ PX_LAST                    : NULL
# $ PE_RATIO                   : NULL
# $ VOL_MEAN                   : NULL
# $ BS_CUSTOMER_DEPOSITS       : num 0.01
# $ BS_TOT_LOAN                : num 0.951
# $ *                          : NULL
# $ RN366                      : NULL
# $ BS_TIER1_CAP_RATIO         : num 0.0118
# $ BS_TOT_CAP_TO_RISK_BASE_CAP: NULL
# $ RETURN_COM_EQY             : NULL
# $ BS_LEV_RATIO_TO_TANG_CAP   : NULL
# $ NPLS_TO_TOTAL_LOANS        : NULL

As per the comments, if there needs to be a grouping by both the ticker and variable this will yield desired results:

g<- lapply(split(bbm, list(bbm$variable, bbm$ticker)), function(x) testx(x$value))
#to remove the NULL which are not needed:
g[g != "NULL"]
erasmortg
  • 3,246
  • 1
  • 17
  • 34
  • Thanks, but it's still not working. The factors yielding `NULL` are not included anyway, so deleting these observations has no effect. As the `ddply` function with `adf.test` is working when grouping by `variable` alone but not by `ticker` alone I tried it with `g<- lapply(split(bbm, list(bbm$variable, bbm$ticker)), function(x) testx(x$value))`. This yields many `NULL`s. How can I efficiently remove these from the dataset? – Christoph Aug 25 '15 at 11:37
  • Why isn't it working? `g[g != "NULL"]` should get rid of the irrelevant groups. The dataset you included shows only one ticker (AV Equity) – erasmortg Aug 25 '15 at 11:44
  • It's working now with `g<- lapply(split(bbm, list(bbm$variable, bbm$ticker)), function(x) testx(x$value))` - so `str(g[g != "NULL"] )` yields all the p-values. I just don't know how I can work with the p-values now...so how can I get these into the Initial data.frame or some other data.frame where I can access them and calculate stuff? – Christoph Aug 25 '15 at 11:53
  • This is getting too extensive for the comments, consider asking another question with the explicit details: what do you want to calculate with them? which data sets do you want to bind (i assume). As for this answer, I'll add your comment as part of the answer, if it helped you solve your problem, consider accepting/upvoting it. – erasmortg Aug 25 '15 at 11:59