8

Dear Community,

the data I receive will be in a data frame:

Var_1      Var_2         Date        VaR_3  VaR_4   VaR_5   Var_6
1           4       2010-01-18         7    apple    10    sweet
2           5       2010-07-19         8    orange   11    sour
3           6       2010-01-18         9    kiwi     12    juicy
...        ...      ...               ...   ...     ...    ... 

I would like to use zoo, since it seems to be a flexible object class. I'm just starting with R and I tried to read the description (vignettes) for the package.

Questions:

  1. Given the above data as a data frame, which method is recommended to convert the complete df into a zoo object, telling zoo that it shall use the third column as date column (dates can occur multiple times in the data)?
  2. How do I aggregate all other columns monthly, except columns 4 and 6 using zoo built-in functions? Is zoo able to automatically discard categorical variables and just use those columns that are suited for aggregation?
  3. How do I aggregate all numeric columns monthly, for each category in column 4 (column 6 shall not be included, since it is non-numeric).

Thanks for your support.

John
  • 1,401
  • 2
  • 11
  • 12

1 Answers1

4

zoo objects are time series and are normally numeric vectors or matrices. It seems that what you really have is a bunch of different time series where column 5 identifies which series it is. That is, there is an apple series, an orange series, a kiwi series, etc. and each of them have several columns.

Dropping the last column since its not numeric, using the third column as the index and splitting on column 5 we have:

# create test data
Lines <- "Var_1      Var_2         Date        VaR_3  VaR_4   VaR_5   Var_6
1           4       2010-01-18         7    apple    10    sweet
2           5       2010-07-19         8    orange   11    sour
3           6       2010-01-18         9    kiwi     12    juicy"
cat(Lines, "\n", file = "data.txt")

library(zoo)
z <- read.zoo("data.txt", header = TRUE, index = 3, split = "VaR_5",
  colClasses = c(Var_6 = "NULL"))

The result is:

> z
           Var_1.apple Var_2.apple VaR_3.apple VaR_5.apple Var_1.kiwi
2010-01-18           1           4           7          10          3
2010-07-19          NA          NA          NA          NA         NA
           Var_2.kiwi VaR_3.kiwi VaR_5.kiwi Var_1.orange Var_2.orange
2010-01-18          6          9         12           NA           NA
2010-07-19         NA         NA         NA            2            5
           VaR_3.orange VaR_5.orange
2010-01-18           NA           NA
2010-07-19            8           11

The above assumes that for a given value of column 5 that the dates are unique. If that is not the case then include the aggregate = mean argument or some other value for aggregate.

To now aggregate it into a monthly zoo series we have:

aggregate(z, as.yearmon, mean)

It would also be possible to convert it straight away to monthly by using the FUN = as.yearmon argument:

zm <- read.zoo("data.txt", header = TRUE, index = "Date", split = "VaR_4", 
  FUN = as.yearmon, colClasses = c(Var_6 = "NULL"), aggregate = mean)

See ?read.zoo, vignette("zoo-read"), ?aggregate.zoo and the other vignettes and help files as well.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Dear Gabor, thanks for your reply and the thoughtful suggestions. In reality my dataset is more complex. For the example I reduced the data frame in a way such that it shows the most important properties. Which column for the split will be used is unknown. There are many more columns containing either numbers or categories. Using your method, I have to sort them out manually when I want to perform an aggregation across the zoo object. Is there a way to use a logical test on each column such that it sums up and keeps only those columns of the zoo object, that contain numerical data? – John Oct 24 '11 at 18:02
  • To clarify a bit further: In the first step I would like to aggregate as described over all data. The second step will be to split up by categories. There is a possibility that for each category there occur entries at the same date. This should be ok though, since I would like to determine in most cases just the sum, the mean and the absolute number of counts for each month. – John Oct 24 '11 at 18:04
  • I think you assuming that the object has numeric and non-numeric columns. zoo objects are based on vectors or matrices and can't contain mixtures of different classes. They should all be numeric in this case. Its not clear what you mean by which column to split on is unknown. In the end you will have to know what it is. You can read the data into a data.frame and use read.zoo on the data frame (the first arg of read.zoo can be a data frame). This can be done multiple times with different splits to get different zoo objects if you wish. – G. Grothendieck Oct 24 '11 at 18:10
  • "zoo...can't contain mixtures of different classes": If I use df.z<-read.zoo(df,index.column=3,aggregate=F) on the data frame doesn't it contain all the columns of the example and therefore is a mixture? – John Oct 25 '11 at 08:41
  • Your second comment "In the end you will have to know what it is" is true, but this is after I have done all investigation on the data and I do not know before which one finally will be used. Therefore in the first step I want an aggregation on all data as described. My question still seems to be open: "Is there a way to use a logical test on each column such that it sums up and keeps only those columns of the zoo object (derived from the data frame), that contain numerical data?" – John Oct 25 '11 at 08:52
  • I am not sure what you want but here is some code that might give you some ideas: `DF <- read.table("data.txt", header = TRUE); is.num <- sapply(DF, is.numeric); colMeans(DF[is.num])` . Or perhaps this is what you want where `DF` is as just calculated: `zz <- read.zoo(data.frame(DF[[3]], DF[[is.num]], aggregate = mean)`. There is also `numcolwise` in the plyr package: `numcolwise(sum)(DF)`. – G. Grothendieck Oct 25 '11 at 12:34
  • Thanks for your suggestions. I think zz <- read.zoo(data.frame(df[[3]], df[[is.num]]), aggregate = mean) should do the job. Did I put the closing bracket correctly? I get an error message though, stating the object 'is.num' is not found. Instead of performing a logical test for numerical values, R seems to interpret is a column name? – John Oct 25 '11 at 17:46
  • Yes, the parenthesis you added is correct. The computation of `is.num` is shown in my last response. Be sure you compute it before using it. – G. Grothendieck Oct 25 '11 at 18:33
  • Thanks for clarification. Using the is.num method I was able to get the first and the second method working. The first method has as output a 2x4 matrix with the columns Var_1 Var_2 VaR_3 VaR_5 and one row containing the numbers 2 5 8 11 correspondingly. This is the mean value for each column. – John Oct 26 '11 at 13:53
  • Annotation for the second method: The second method has as output the same columns but 2 rows. The first row starts with 2010-01-18 followed by 2 5 8 11, the second row starts with 2010-07-19 followed by the same numbers 2 5 8 11. My assumption is, that these numbers are the mean values for each date occuring. – John Oct 26 '11 at 13:54