-4

I have a sample dataframe "data" as follows:

X          Y  Month   Year    income
2281205 228120  3   2011    1000
2281212 228121  9   2010    1100
2281213 228121  12  2010    900
2281214 228121  3   2011    9000
2281222 228122  6   2010    1111
2281223 228122  9   2010    3000
2281224 228122  12  2010    1889
2281225 228122  3   2011    778
2281243 228124  12  2010    1111
2281244 228124  3   2011    200
2281282 228128  9   2010    7889
2281283 228128  12  2010    2900
2281284 228128  3   2011    3400
2281302 228130  9   2010    1200
2281303 228130  12  2010    2000
2281304 228130  3   2011    1900
2281352 228135  9   2010    2300
2281353 228135  12  2010    1333
2281354 228135  3   2011    2340

I use the ddply to compute the income for each Y

x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income))
#Now, I also need to find out the X for each Y depending upon the following conditions: 
a. If Y consists of observations of months 9 (2010), 12 (2010), and 3 (2011), then the x corresponds to months 9(2010) i.e. for Y =228121 x=2281212
b. If Y consists of observations of month 6 (2010), 9 (2010), 12(2010) , and 3 (2011)  then the x corresponds to months 6 (2010) i.e. for Y =228122 x=2281222. 
c. If Y consists of observations of month 12 (2010), 3 (2011) then the x corresponds to months 12 (2010) i.e. for Y =228124 x=2281243. 
d. If Y consists of observations of month 12 (2010), 3 (2011) then the x corresponds to months 12 (2010) i.e. for Y =228124 x=2281243. 
e. If Y consists of only one observation then the x corresponds to month of that observation i.e. for Y =228120 x=2281205.

The point here is if I have more than one observation for each Y, that I am choosing x corresponding to month 6 (2010) if available, but if that is not available I choose months close to 6 (2010) (eg. 9 (2010)). Note that if I have only one observation, I will choose x for that observation.

Please suggest how to incorporate these conditions in ddply.

Community
  • 1
  • 1
Metrics
  • 15,172
  • 7
  • 54
  • 83

1 Answers1

2

This solution assumes that earliest X value is selected as sugessted by @DWin. Month and Year variables are converted to date format and then earliest is selected as condition to select X.

library(zoo) #necessary for date manipulation
x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income),
 X=X[as.yearmon(paste(Month,Year,sep="/"),format="%m/%Y")==min(as.yearmon(paste(Month,Year,sep="/"),format="%m/%Y"))])

       Y freq   tot       X
1 228120    1  1000 2281205
2 228121    3 11000 2281212
3 228122    4  6778 2281222
4 228124    2  1311 2281243
5 228128    3 14189 2281282
6 228130    3  5100 2281302
7 228135    3  5973 2281352

EDIT - solution assuming that there could be values before 2010-06

library(zoo)
#new column containing dates made from Month and Year
data$Time<-as.Date(as.yearmon(paste(data$Month,data$Year,sep="/"),format="%m/%Y"))

#calculated difference between new date column and 2010-06
data$Time.dif<-abs(as.numeric(data$Time-as.Date("2010-06-01")))

#now selects X when Time.dif is smallest (0 in case of 2010-06)
x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income), 
           X=X[Time.dif==min(Time.dif)])
Didzis Elferts
  • 95,661
  • 14
  • 264
  • 201
  • Thanks Dwin and Didzis. Actually, I wanted X that is close to or equal to month 6 (2010). In the sample data above, there were no observations with the month 3 (2010). So, your solution is perfect for that. What if I have observations with month 3 (2010)? – Metrics Jan 03 '13 at 15:37
  • 1
    @user1493368 updated my solution for situation with earlier times than 2010-06 – Didzis Elferts Jan 03 '13 at 17:05
  • Thanks Didzis. I am sorry that I couldn't be clearer in the question. It looks to me (from your code) that when I have only two or three or four observations (for each Y) with no month 6 (2010), it will give X corresponding to month 3 (2010) [Perfect!]. However, when I have two or three or four or five observations (for each Y) with month 6 (2010), it will again give X corresponding to month 3 (2010) which is not what I wanted. I want to have X corresponding to month 3(2010) only when I don’t have the data on month 6 (2010). contd... – Metrics Jan 03 '13 at 18:57
  • So, the point is that I wanted X that is equal to month 6 (2010) or close to it if month 6(2010) is not available. Below is the code, I used: `x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income), ifelse(Time.dif>=0, X=X[Time.dif==0], X=X[Time.dif==min(Time.dif)]))` However, I got the following error: Error: length(rows) == 1 is not TRUE – Metrics Jan 03 '13 at 18:58
  • 1
    @user1493368 added abs() to calculation of data$Time.dif, so also for month 3 (2010) difference will be larger than 0. – Didzis Elferts Jan 03 '13 at 19:18
  • That's awesome!Thanks for the help. However, I got the same error as I got when I use ifelse. `Error: length(rows) == 1 is not TRUE` – Metrics Jan 03 '13 at 21:00