How to get column mean for specific rows only?

Question

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:

period 1: year <= 1983
period 2: year >= 1984 & year <= 1990
period 3: year >= 1991

This is the structure of my data:

  country year     score        
 Algeria 1980     -1.1201501 
 Algeria 1981     -1.0526943 
 Algeria 1982     -1.0561565 
 Algeria 1983     -1.1274560 
 Algeria 1984     -1.1353926 
 Algeria 1985     -1.1734330 
 Algeria 1986     -1.1327666 
 Algeria 1987     -1.1263586 
 Algeria 1988     -0.8529455 
 Algeria 1989     -0.2930265 
 Algeria 1990     -0.1564207 
 Algeria 1991     -0.1526328 
 Algeria 1992     -0.9757842 
 Algeria 1993     -0.9714060 
 Algeria 1994     -1.1422258 
 Algeria 1995     -0.3675797 
 ...

The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.

This is how it should look like:

country year     score         mean   
 Algeria 1980     -1.1201501     -1.089
 Algeria 1981     -1.0526943     -1.089
 Algeria 1982     -1.0561565     -1.089
 Algeria 1983     -1.1274560     -1.089
 Algeria 1984     -1.1353926     -0.839
 Algeria 1985     -1.1734330     -0.839
 Algeria 1986     -1.1327666     -0.839
 Algeria 1987     -1.1263586     -0.839
 Algeria 1988     -0.8529455     -0.839
 Algeria 1989     -0.2930265     -0.839
 Algeria 1990     -0.1564207     -0.839
 ...

Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...

Many many thanks for your help!

IRTFM · Accepted Answer · 2012-09-14T00:47:44.507

17

datfrm$mean <-
  with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )

The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:

mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )

edited Sep 14 '12 at 00:47

answered Sep 12 '12 at 18:44

IRTFM

258,963
21
364
487

WOW! This was amazingly quick - and very helpful indeed. Thanks a lot for this super elegant solution! – TiF Sep 12 '12 at 19:01
5

The `findInterval` function needs a better advertising agency. – IRTFM Sep 12 '12 at 19:04

Matt Dowle · Answer 2 · 2012-09-12T20:50:18.257

6

~~Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted~~ [proved wrong, thanks @DWin]. For completeness the data.table equivalent (scales for large data) is :

require(data.table)
DT = as.data.table(DF)   # or just start with a data.table in the first place

DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]

or findInterval is likely faster as DWin used :

DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]

edited Sep 12 '12 at 20:50

answered Sep 12 '12 at 19:01

Matt Dowle

58,872
22
166
224

2

That claim about findInterval is incorrect (confirmed after testing), but thanks for the DT offering. – IRTFM Sep 12 '12 at 19:05

score 0 · Answer 3 · answered Oct 31 '19 at 19:58

If the rows are ordered by year, I think the easiest way to accomplish this would be:

m80_83 <- mean(dataframe[1:4,3]) #Finds the mean of the values of column 3 for rows 1 through 4
m84_90 <- mean(dataframe[5:10,3])
#etc.

If the rows are not ordered by year, I would use tapply like this.

list.of.means <- c(tapply(dataframe$score, cut(dataframe$year, c(0,1983.5, 1990.5, 3000)), mean)

Here, tapply takes three parameters:

First, the data you want to do stuff with (in this case, datafram$score).

Second, a function that cuts that data up into groups. In this case, it will cut the data into three groups based on the dataframe$year values. Group 1 will include all rows with dataframe$year values from 0 to 1983.5, Group 2 will include all rows with dataframe$year values from 1983.5 to 1990.5, and Group 3 will include all rows with dataframe$year values from 1983.5 to 3000.

Third, a function that is applied to each group. This function will apply to the data you selected as your first parameter.

So, list.of.means should be a list of the 3 values you are looking for.

How to get column mean for specific rows only?

3 Answers3

Linked