0

Please help me out. I have been stuck for a long time. I am an R beginner, and I want to know how to perform operations on data frame elements using multiple indices e.g. firms and time as in the below case elegantly? I want to perform some typical operations e.g year-wise mean on some other variables for each firm, and here is an example: I am want to calculate year-wise mean from daily data of variables A and F for all firms in the dataset. This is my dataset. I have used dplyr and hydroTSM packages. Following is my code:

MeanA<- df %>% select(Firm, Date, A,) %>% group_by(Firm) %>% do(daily2annual(., A, FUN=mean, na.rm = TRUE, out.fmt="%Y", dates = 2))

It returns the following error:

Error in eval(expr, envir, enclos) : argument is missing, with no default

I have also tried the data.table package with adding an additional Year column. Code:

MeanA <- df[ , A, by = "Firm" & "Year"]

Result:

Error in `[.data.frame`(df, , A, by = "Firm" & "Year") : unused argument(by = "Firm" & "Year")

Please suggest a way out (with or without dplyr). The table of means, as well as other year-wise values I would be calculating, is input for panel regression analysis using plm package. Thanks in advance.

Polar Bear
  • 731
  • 1
  • 7
  • 21
  • Have you tried by using the function `daily2annual` inside `mutate/summarise`? – akrun Mar 05 '16 at 18:26
  • I haven't tested it as I don't have the hydroTSM package installed. You could try the `do` option by HongOoi, – akrun Mar 05 '16 at 18:48
  • The error message makes sense. I'm seeing your "Date values in the form "m/d/yyyy HH:MM:SS", (and these are probably factor class at the moment) so conversion to Date needs an `as.Date` coercion since it's non-standard. And the HH:MM:SS values are all zero. Usually you can create an aggregation value for year with a call to format(., format="%y") – IRTFM Mar 06 '16 at 00:05
  • No need to transform the data yourself (demean by using the group means for the within model), just let 'plm' take care of that. The data transformations are also directly available in 'plm' with e.g. 'plm::within', 'plm::between', and 'plm::Between'. However, you need to convert your data to a 'pdata.frame' first (see plm's vignette). – Helix123 Mar 07 '16 at 08:00
  • @Helix123 I need to transform because I want to estimate how F influence A across 5 Firms and over 5 YEARS. So it seems year-wise means of A and F shall be inputs for pdata.frame. – Polar Bear Mar 07 '16 at 13:32

2 Answers2

1

Well this isn't a pretty answer but I'm tired of working at it. If you want the yearly mean by firm, I extracted the year from the date field and then used this in the aggregate function.

df$Date = as.Date(df$Date, "%m/%d/%Y")
df$F = as.numeric(df$F)
df$Year = as.Date(paste(substr(df$Date, 1, 4), "1", "1",sep="-"), "%Y-%m-%d")
newDat = aggregate(cbind(A, E, F) ~ Firm + Year, df, FUN = mean)
Dave-Evans
  • 126
  • 2
  • 3
  • 1
    I'm not sure what you mean, "sort the result based on firm and then years". Do you mean you would like the data in a different order but the same values? You can either use the `order` function or just switch `Firm` and `Year` in the aggregate function: `aggregate(cbind(A, E, F) ~ Year + Firm, df, FUN = mean)` – Dave-Evans Mar 05 '16 at 23:12
  • Thanks @Dave-Evans for helping out, at least it works and give the needed result. As C & D need year-wise sum per firm, suppose now I run: newDat1 <-aggregate(cbind(C, D) ~ Firm + Year, df, FUN = sum) . How I combine newDat & newDat1 together. Or any other suitable method to solve the problem. Thanks again – Polar Bear Mar 06 '16 at 19:27
  • Provided that both `newdat`s have the same columns you can use the row bind function `rbind` to combine them. `dat = cbind(newDat, newDat1)`. If this answer solves you problem please mark it as the accepted answer, thanks. – Dave-Evans Mar 06 '16 at 20:40
  • The newDat has values as year-wise means of A, E, and F for each firm and newDat1 has values as year-wise means of C and D for each firm. When I do dat = cbind(newDat, newDat1) it duplicates both Firm, and Year columns. So instead of 6 dat has 8 columns and deleting columns makes it really confusing because this file is merely a small test data. The main data file is a very large dataset. – Polar Bear Mar 07 '16 at 04:38
  • I'm sorry, I bind `rbind` and not `cbind`. My apologies. That should clear things up. I can't seem to edit my previous comment but try `dat = rbind(newDat, newDat1)` – Dave-Evans Mar 07 '16 at 14:56
  • it give following: Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match . Obviously, I had to match two tables with different variables indexed on firms and years. – Polar Bear Mar 07 '16 at 16:14
  • Of course, of course, again my apologies. Try the `merge` function. `merge(newDat, newDat1, by=c("Firm", "Year"), all=T)`. If that exact code doesn't work, try other options in that same function, that should allow you to combine your datasets. Alternatively you could try `aggregate(cbind(A, B, C, D, E, F) ~ Year + Firm, FUN=function(x) c(avg=mean(x), sum=sum(x)),data=df)` to get both sums and means for all your columns – Dave-Evans Mar 07 '16 at 17:34
  • I need to apply mean and sum and some other customized functions for different variables and not to all variables. – Polar Bear Mar 07 '16 at 18:01
0

Try

MeanA<- df %>%
    select(Firm, Date, A) %>%
    group_by(Firm) %>%
    do(daily2annual(., A, mean, na.rm = TRUE, dates = .$Date))
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • It gives following error: Error in match(class(dates), c("numeric", "factor", "Date")) : object 'Date' not found – Polar Bear Mar 05 '16 at 18:59
  • @Siddiqui Try it now. – Hong Ooi Mar 05 '16 at 19:04
  • Did you read in the date correctly? This looks like your Date is not a Date / Posix object but a character – Buggy Mar 05 '16 at 19:04
  • Error in daily2annual.data.frame(., A, mean, na.rm = TRUE, dates = .$Date) : Invalid argument: 'dates' must be of class 'numeric', 'factor', 'Date' did but it not worked. In addition: Warning message: In if (is.na(match(class(dates), c("numeric", "factor", "Date")))) stop("Invalid argument: 'dates' must be of class 'numeric', 'factor', 'Date'") : the condition has length > 1 and only the first element will be used – Polar Bear Mar 05 '16 at 19:09
  • your dates are posix not date. Try using as.Date(df$Date) before running it. – Buggy Mar 05 '16 at 19:17