Calculate the average based on other columns

Question

I want to calculate

"average of the closing prices for the 5,10,30 consecutive trading days immediately preceding and including the Announcement Day, but excluding trading halt days (days on which trading volume is 0 or NA)

For example, now we set 2014/5/7 is the Announcement day.

then average of price for 5 consecutive days :

average of (price of 2014/5/7,2014/5/5, 2014/5/2, 2014/4/30,2014/4/29),

price of 2014/5/6 and 2014/5/1 was excluded due to 0 trading volume on those days.

EDIT on 11/9/2014

One thing to Note: the announcement day for each stock is different, and it's not last valid date in the data, so usage of tail when calculating average was not appropriate.

Date        Price   Volume
2014/5/9    1.42    668000
2014/5/8    1.4     2972000
2014/5/7    1.5     1180000
2014/5/6    1.59    0
2014/5/5    1.59    752000
2014/5/2    1.6     138000
2014/5/1    1.6     NA
2014/4/30   1.6     656000
2014/4/29   1.61    364000
2014/4/28   1.61    1786000
2014/4/25   1.64    1734000
2014/4/24   1.68    1130000
2014/4/23   1.68    506000
2014/4/22   1.67    354000
2014/4/21   1.7     0
2014/4/18   1.7     0
2014/4/17   1.7     1954000
2014/4/16   1.65    1788000
2014/4/15   1.71    1294000
2014/4/14   1.68    1462000

Reproducible Code:

require(quantmod)
require(data.table)

tickers <- c("0007.hk","1036.hk")
date_begin <- as.Date("2010-01-01")
date_end <- as.Date("2014-09-09")


# retrive data of all stocks
prices <- getSymbols(tickers, from = date_begin, to = date_end, auto.assign = TRUE)

dataset <- merge(Cl(get(prices[1])),Vo(get(prices[1])))


for (i in 2:length(prices)){
  dataset <- merge(dataset, Cl(get(prices[i])),Vo(get(prices[i])))
}

# Write First
write.zoo(dataset, file = "prices.csv", sep = ",", qmethod = "double")

# Read zoo
test <- fread("prices.csv")

setnames(test, "Index", "Date")

Then I got a data.table. The first Column is Date, then the price and volume for each stock.

Actually, the original data contains information for about 40 stocks. Column names have the same patter: "X" + ticker.close , "X" + ticker.volumn

Last trading days for different stock were different.

The desired output :

days    0007.HK 1036.HK
5       1.1     1.1
10      1.1     1.1
30      1.1     1.1

The major issues:

.SD and lapply and .SDCol can be used for looping different stocks. .N can be used when calculating last consecutive N days.
Due to the different announcement day, it becomes a little complicated.

Any suggestions on single stock using quantmod or multiple stocks using data.table are extremely welcomed!

Thanks GSee and pbible for the nice solutions, it was very useful. I'll update my code later incorporating different announcement day for each stocks, and consult you later.

Indeed, it's more a xts question than a data.table one. Anything about data.table will be very helpful. Thanks a lot!

Because the different stocks have different announcement days, I tried to make a solution first following @pbible's logic, any suggestions will be extremely welcomed.

library(quantmod)
tickers <- c("0007.hk","1036.hk")
date_begin <- as.Date("2010-01-01")

# Instead of making one specific date_end, different date_end is used for convenience of the following work.

date_end <- c(as.Date("2014-07-08"),as.Date("2014-05-15"))

for ( i in 1: length(date_end)) {

  stocks <- getSymbols(tickers[i], from = date_begin, to = date_end[i], auto.assign = TRUE)
  dataset <- cbind(Cl(get(stocks)),Vo(get(stocks)))
  usable <- subset(dataset,dataset[,2] > 0 & !is.na(dataset[,2]))
  sma.5 <- SMA(usable[,1],5)
  sma.10 <- SMA(usable[,1],10)
  sma.30 <- SMA(usable[,1],30)
  col <- as.matrix(rbind(tail(sma.5,1), tail(sma.10,1), tail(sma.30,1)))
  colnames(col) <- colnames(usable[,1])
  rownames(col) <- c("5","10","30")

  if (i == 1) {
    matrix <- as.matrix(col)
  }
  else  {matrix <- cbind(matrix,col)}
}

I got what I want, but the code is ugly..Any suggestions to make it elegant are extremely welcomed!

I don't see how your "desired output" is possible given the input — GSee, Sep 10 '14 at 11:43
This sounds just like a simple moving average. It looks like the TTR package's [SMA](http://www.inside-r.org/packages/cran/TTR/docs/GD) function could do this, or am I missing something? — pbible, Sep 10 '14 at 12:03
@GSee,hey, It seems that you deleted your answer, which was a very useful one. I'm really sorry for that. Thanks for your time and would you mind to tell me whether what I said gave you a bad feeling? Thanks a lot! — Bigchao, Sep 10 '14 at 15:37
@Bigchao No hard feelings, I just think it didn't answer your question. Your question is not 100% clear to me (as indicated by my first comment above), and you have not illustrated how you have tried to solve the problem yourself. — GSee, Sep 10 '14 at 15:48
@GSee，I'll add my solutions which I thought maight be misleading. And welcome the comments at that time. thanks a lot — Bigchao, Sep 10 '14 at 15:55
I think that multiple stocks in the same table may be a red herring. Different start days and different volume days seems to indicate that you should treat each one separately and then `cbind` the solutions at the end. — pbible, Sep 10 '14 at 15:56
@GSee, Morning, I updated my solution, which is not efficient one. Could you give some comments? Thanks very much! — Bigchao, Sep 11 '14 at 07:01
@Bigchao I'm glad you found a workable solution. You may want to post another question that gets at your new specific problems. Adding more updates to this one will be confusing and won't lead to better answers. If you want to improve a working piece of code consider [Code Review](http://codereview.stackexchange.com/) — pbible, Sep 11 '14 at 12:01
@Bigchao if on a given date one stock has data and the other does not, do you intend to still use that data for the stock that has it? i.e. do you need to find a common index first? — GSee, Sep 11 '14 at 12:47
@GSee,thanks for your comments. it doesn't need to find a common index first. Actually, each data of stocks is totally independent of others. could you give some comments on how to avoid looping using apply or any other approaches? Thanks a lot! — Bigchao, Sep 11 '14 at 14:26

GSee · Accepted Answer · 2014-09-11T15:35:43.293

Well, here's a way to do it. I don't know why you want to get rid of the loop, and this does not get rid of it (in fact it has a loop nested inside another). One thing that you were doing is growing objects in memory with each iteration of your loop (i.e. the matrix <- cbind(matrix,col) part is inefficient). This Answer avoids that.

library(quantmod)
tickers <- c("0007.hk","1036.hk")
date_begin <- as.Date("2010-01-01")

myEnv <- new.env()
date_end <- c(as.Date("2014-07-08"),as.Date("2014-05-15"))
lookback <- c(5, 10, 30) # different number of days to look back for calculating mean.

symbols <- getSymbols(tickers, from=date_begin, 
                      to=tail(sort(date_end), 1), env=myEnv) # to=last date
end.dates <- setNames(date_end, symbols)

out <- do.call(cbind, lapply(end.dates, function(x) {
  dat <- na.omit(get(names(x), pos=myEnv))[paste0("/", x)]
  prc <- Cl(dat)[Vo(dat) > 0]
  setNames(vapply(lookback, function(n) mean(tail(prc, n)), numeric(1)), 
           lookback)
}))

colnames(out) <- names(end.dates)
out

#   0007.HK 1036.HK
#5    1.080   8.344
#10   1.125   8.459
#30   1.186   8.805

Some commentary...

I created a new environment, myEnv, to hold your data so that it does not clutter your workspace.
I used the output of getSymbols (as you did in your attempt) because the input tickers are not uppercase.
I named the vector of end dates so that we can loop over that vector and know both the end date and the name of the stock.
the bulk of the code is an lapply loop (wrapped in do.call(cbind, ...)). I'm looping over the named end.dates vector.
1. The first line gets the data from myEnv, removes NAs, and subsets it to only include data up to the relevant end date.
2. The next line extracts the close column and subsets it to only include rows where volume is greater than zero.
3. The vapply loops over a vector of different lookbacks and calculates the mean. That is wrapped in setNames so that each result is named based on which lookback was used to calculate it.
The lapply call returns a list of named vectors. do.call(cbind, LIST) is the same as calling cbind(LIST[[1]], LIST[[2]], LIST[[3]]) except LIST can be a list of any length.
at this point we have a matrix with row names, but no column names. So, I named the columns based on which stock they represent.

Hope this helps.

Thanks so much. It's really helpful and awesome. It's deep night, so I will dig into details tomorrow. Thanks again:) — Bigchao, Sep 11 '14 at 16:01
@GSee I like your solution, but I'm curious about two things. Can you explain how cbind in do.call avoids 'growing objects in memory'? Also that function doesn't see trivial. Is it common R style to embed functions like this? Just curious, R isn't my primary language. thanks — pbible, Sep 11 '14 at 20:11
@pbible `lapply` pre-allocates memory because it knows the size of the list, then you only call `cbind` once. This is different than looping and every time through the loop `cbind`ing to dynamically grow an object (which is costly). For the second question, it is common to pass `lapply` an object and a function. Using an anonymous function may be less common (or less clear). The `lapply` could easily be changed to a `for` loop as long as you pre-allocate a list and then populate it instead of growing it as you go. — GSee, Sep 11 '14 at 20:18

pbible · Answer 2 · 2014-09-10T20:40:14.180

1

How about something like this using the subset and moving average (SMA). Here is the solution I put together.

library(quantmod)

tickers <- c("0007.hk","1036.hk","cvx")
date_begin <- as.Date("2010-01-01")
date_end <- as.Date("2014-09-09")

stocks <- getSymbols(tickers, from = date_begin, to = date_end, auto.assign = TRUE)

stock3Summary <- function(stock){
  dataset <- cbind(Cl(get(stock)),Vo(get(stock)))
  usable <- subset(dataset,dataset[,2] > 0 & !is.na(dataset[,2]))
  sma.5 <- SMA(usable[,1],5)
  sma.10 <- SMA(usable[,1],10)
  sma.30 <- SMA(usable[,1],30)
  col <- as.matrix(rbind(tail(sma.5,1), tail(sma.10,1), tail(sma.30,1)))
  colnames(col) <- colnames(usable[,1])
  rownames(col) <- c("5","10","30")
  col
}

matrix <- as.matrix(stock3Summary(stocks[1]))

for( i in 2:length(stocks)){
  matrix <- cbind(matrix,stock3Summary(stocks[i]))
}

The output:

> matrix
   X0007.HK.Close X1036.HK.Close CVX.Close
5        1.082000       8.476000  126.6900
10       1.100000       8.412000  127.6080
30       1.094333       8.426333  127.6767

This should work with multiple stocks. It will use only the most recent valid date.

edited Sep 10 '14 at 20:40

answered Sep 10 '14 at 13:33

pbible

1,259
1
18
34

Thanks a lot, @pbible, I modified the question a little bit to make it clear. Actually the average will not based only on the most recent valid date. Do you have any ideas about how to proceed. I'll update my code to achieve this later for your comments. – Bigchao Sep 11 '14 at 05:47
I made one solution based on your logic. But it seems ugly now, do you have comments about how to make it elegant? Thanks a lot. Putting getSymbol into a loop is not efficiency. – Bigchao Sep 11 '14 at 06:38
@Bigchao I think you could add a date parameter to the function and instead of the `tail` call `as.matrix(rbind(sma.5[mydate,1]...` in the loop. Then a list of dates that will parallel your list of stocks so like `dates[i]` matching your `stocks[i]`. I haven't had a chance to test that though. – pbible Sep 11 '14 at 11:53

Calculate the average based on other columns

2 Answers2

Linked