Unable to take mean of columns(A,B,C,D) for any year & sort the frame based on year for plotting in R

Question

Guidelines : 1. File to be used for "df" is copied below:

    A       B      C       D          Date  Year
14.99    9.99   3.99    2.99      1/1/2002  2002
10.99    8.99   3.99    2.99      1/1/2006  2006
14.99    9.99                     1/1/2006  2006
14.99    9.99   3.99    2.99      1/1/1998  1998
14.99   12.99   3.99    2.99    12/25/2012  2012
10.99   10.99   3.99    2.99      4/1/2014  2014
14.99    9.99   3.99    2.99     4/15/2011  2011
14.99   12.99                    9/27/2013  2013
14.99   12.99                     5/2/2014  2014
14.99   12.99   3.99    2.99     6/17/2014  2014
14.99   12.99                     6/7/2013  2013
14.99   12.99   3.99    2.99      3/1/2013  2013
14.99    9.99   3.99    2.99    11/17/2007  2007
14.99    9.99   3.99    2.99      1/1/1987  1987
19.99   17.99   5.99    4.99     6/13/2014  2014
10.99    7.99   3.99    2.99     2/11/2014  2014
14.99   12.99   3.99    2.99      5/9/2014  2014
         9.99           2.99      1/1/2003  2003
14.99    9.99   3.99    2.99      1/1/2003  2003
14.99    9.99   3.99    2.99     11/2/2012  2012
14.99   12.99   3.99    2.99     7/17/2013  2013
14.99   12.99   3.99    2.99      7/1/1980  1980
10.99    8.99   3.99    2.99     9/30/2011  2011
         9.99           2.99      1/1/1996  1996
14.99   12.99                     3/7/2014  2014
14.99    9.99   3.99    2.99     7/29/1966  1966
         9.99                     1/1/1966  1966
14.99   12.99   3.99    2.99      3/5/2013  2013
14.99    9.99   3.99    2.99      1/1/1998  1998
12.99    9.99   3.99    2.99     7/11/2007  2007
14.99    9.99   3.99    2.99      1/1/2004  2004
14.99    9.99   3.99    2.99      1/1/1992  1992
14.99   12.99                    10/4/2013  2013
                6.99    6.99     1/30/2015  2015

Blank Values to be replaced by NA in data.frame "df".
Whenever more than 1 row is present for any year keep only 1 row for that year and update the value in columns "A", "B", "C", "D" with mean for any corresponding year ("NA" should not be counted for taking mean)
Sort the data frame obtained in above step by year

I had used below command for aggregate and sort, but it didn't worked:

aggregate(x=df[,-c(5)], by=list(df$Year), FUN = Mean, na.rm=TRUE)

score 1 · Answer 1 · answered Mar 27 '15 at 08:50

1

You could use the sqldf library which finds the averages of each column by year and sorts them accordingly as follows:

sqldf("select year, avg(A), avg(B), avg(C), avg(D) from df group by year order by year")

The average will ignore NAs.

answered Mar 27 '15 at 08:50

JoeArtisan

135
8

Sum of numbers is done correctly for any year; but while dividing the sum it is considering NA also in count. – Peyush K Mar 27 '15 at 09:56
@PeyushK if I use the following command `sqldf("select year, count(A), count(C) from df group by year order by year")` it doesn't appear to count the NAs. This [answer](http://stackoverflow.com/questions/8859124/na-values-using-sqldf) states that if `df` is a data.frame then it should ignore NAs when calculating the average. If you know anything to the contrary please let me know. – JoeArtisan Mar 27 '15 at 10:45
Example - Let say the rows for year 1966 are :- 14.99 9.99 3.99 2.99 7/29/1966 1966 9.99 1/1/1966 1966 – Peyush K Mar 27 '15 at 10:53
O/P expect 1966 14.99 9.99 3.99 2.99 O/P coming - NA counted 1966 7.49 9.99 1.99 1.49 @ JoeArtisan can you help ? – Peyush K Mar 27 '15 at 11:04
Can you give me a reproducible example? When I used the command I gave in my answer, I get `1966 14.99000 9.99000 3.99 2.99`. – JoeArtisan Mar 27 '15 at 11:33
have you used both the rows for 1966 from example above ? – Peyush K Mar 30 '15 at 11:16
I did use both rows. These are my results (apologies for the dodgy formatting) `sqldf('select count(*) from df where year = 1966') count(*) 2` `sqldf('select count(A), count(B), count(C), count(D) from df where year = 1966') count(A) count(B) count(C) count(D) 1 2 1 1` `sqldf('select sum(A), sum(B), sum(C), sum(D) from df where year = 1966') sum(A) sum(B) sum(C) sum(D) 14.99 19.98 3.99 2.99` `sqldf('select avg(A), avg(B), avg(C), avg(D) from df where year = 1966') avg(A) avg(B) avg(C) avg(D) 14.99 9.99 3.99 2.99` – JoeArtisan Mar 30 '15 at 11:48

score 0 · Answer 2 · answered Mar 27 '15 at 08:36

Only one line solution from data table can give your output,

Try following

df = read.table(text = 'A   B   C   D   Date    Year
14.99   9.99    3.99    2.99    1/1/2002    2002
10.99   8.99    3.99    2.99    1/1/2006    2006
14.99   9.99    NA      NA      1/1/2006    2006
14.99   9.99    3.99    2.99    1/1/1998    1998
14.99   12.99   3.99    2.99    12/25/2012  2012
10.99   10.99   3.99    2.99    4/1/2014    2014
14.99   9.99    3.99    2.99    4/15/2011   2011
14.99   12.99   NA      NA      9/27/2013   2013
14.99   12.99   NA      NA      5/2/2014    2014
14.99   12.99   3.99    2.99    6/17/2014   2014
14.99   12.99   NA      NA      6/7/2013    2013
14.99   12.99   3.99    2.99    3/1/2013    2013
14.99   9.99    3.99    2.99    11/17/2007  2007
14.99   9.99    3.99    2.99    1/1/1987    1987
19.99   17.99   5.99    4.99    6/13/2014   2014
10.99   7.99    3.99    2.99    2/11/2014   2014
14.99   12.99   3.99    2.99    5/9/2014    2014
NA      9.99    NA      2.99    1/1/2003    2003
14.99   9.99    3.99    2.99    1/1/2003    2003
14.99   9.99    3.99    2.99    11/2/2012   2012
14.99   12.99   3.99    2.99    7/17/2013   2013
14.99   12.99   3.99    2.99    7/1/1980    1980
10.99   8.99    3.99    2.99    9/30/2011   2011
NA      9.99    NA      2.99    1/1/1996    1996
14.99   12.99   NA      NA      3/7/2014    2014
14.99   9.99    3.99    2.99    7/29/1966   1966
NA      9.99    NA      NA      1/1/1966    1966
14.99   12.99   3.99    2.99    3/5/2013    2013
14.99   9.99    3.99    2.99    1/1/1998    1998
12.99   9.99    3.99    2.99    7/11/2007   2007
14.99   9.99    3.99    2.99    1/1/2004    2004
14.99   9.99    3.99    2.99    1/1/1992    1992
14.99   12.99   NA      NA      10/4/2013   2013
NA      NA      6.99    6.99    1/30/2015   2015
', header = T)

dt = as.data.table(df)

dt[order(Year), list(A = mean(A, na.rm = TRUE),
          B = mean(B, na.rm = TRUE),
          C = mean(C, na.rm = TRUE),
          D = mean(D, na.rm = TRUE)), by = Year]

   Year        A        B    C    D
 1: 2002 14.99000  9.99000 3.99 2.99
 2: 2006 12.99000  9.49000 3.99 2.99
 3: 1998 14.99000  9.99000 3.99 2.99
 4: 2012 14.99000 11.49000 3.99 2.99
 5: 2014 14.56143 12.70429 4.39 3.39
 6: 2011 12.99000  9.49000 3.99 2.99
 7: 2013 14.99000 12.99000 3.99 2.99
 8: 2007 13.99000  9.99000 3.99 2.99
 9: 1987 14.99000  9.99000 3.99 2.99
10: 2003 14.99000  9.99000 3.99 2.99
11: 1980 14.99000 12.99000 3.99 2.99
12: 1996      NaN  9.99000  NaN 2.99
13: 1966 14.99000  9.99000 3.99 2.99
14: 2004 14.99000  9.99000 3.99 2.99
15: 1992 14.99000  9.99000 3.99 2.99
16: 2015      NaN      NaN 6.99 6.99

I get error on execution - "There were 50 or more warnings (use warnings() to see the first 50)" . on using warnings() i get message **bold**- > warnings() Warning messages: 1: In mean(estHD, na.rm = TRUE) : argument is not numeric or logical: returning NA 2: In mean(estSD, na.rm = TRUE) : argument is not numeric or logical: returning NA 3: In mean(vodHD, na.rm = TRUE) : argument is not numeric or logical: returning NA **bold** @vrajs5 - Moreover the return is not sorted by year column — Peyush K, Mar 27 '15 at 10:42

Unable to take mean of columns(A,B,C,D) for any year & sort the frame based on year for plotting in R

2 Answers2