0

Just a little background: I got into programming through statistics, and I don't have much formal programming experience, I just know how to make things work. I'm open to any suggestions to come at this from a differet direction, but I'm currently using multiple sqldf queries to get my desired data. I originally started statistical programming in SAS and one of the things I used on a regular basis was the macro programing ability.

For a simplistic example say that I have my table A as the following:

Name      Sex    A    B       DateAdded
John      M      72   1476    01/14/12
Sue       F      44   3269    02/09/12
Liz       F      90   7130    01/01/12
Steve     M      21   3161    02/29/12

The select statement that I'm currently using is of the form: sqldf("SELECT AVG(A), SUM(B) FROM A WHERE DateAdded >= '2012-01-01' AND DateAdded <= '2012-01-31'")

Now I'd like to run this same query on the enteries where DateAdded is in the Month of February. From my experience with SAS, you would create macro variables for the values of DateAdded. I've considered running this as a (very very slow) for loop, but I'm not sure how to pass an R variable into the sqldf, or whether that's even possible. In my table, I'm using the same query over years worth of data--any way to streamline my code would be much appreciated.

  • You could use the `GROUP BY` function of SQL instead. Which database to you use? – sgibb Jul 06 '12 at 18:25
  • As far as I know, you can't `GROUP BY` a range of dates. I have dates in the form of MMDDYY and I want to group them into months and weeks. – user1507458 Jul 06 '12 at 18:39
  • e.g. MySQL has a MONTH function: http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_month – sgibb Jul 06 '12 at 19:21

1 Answers1

3

Read in the data, convert the DateAdded column to Date class, add a yearmon (year/month) column and then use sqldf or aggregate to aggregate by year/month:

Lines <- "Name      Sex    A    B       DateAdded
John      M      72   1476    01/14/12
Sue       F      44   3269    02/09/12
Liz       F      90   7130    01/01/12
Steve     M      21   3161    02/29/12"

DF <- read.table(text = Lines, header = TRUE)

# convert DateAdded column to Date class
DF$DateAdded <- as.Date(DF$DateAdded, format = "%m/%d/%y")

# add a year/month column using zoo
library(zoo)
DF$yearmon <- as.yearmon(DF$DateAdded)

Now that we have the data and its in the right form the answer is just one line of code. Here are two ways:

# 1. using sqldf
library(sqldf)
sqldf("select yearmon, avg(A), avg(B) from DF group by yearmon")

# 2. using aggregate
aggregate(cbind(A, B) ~ yearmon, DF, mean)

The result of the last two lines is:

> sqldf("select yearmon, avg(A), avg(B) from DF group by yearmon")
   yearmon avg(A) avg(B)
1 Jan 2012   81.0   4303
2 Feb 2012   32.5   3215
> 
> # 2. using aggregate
> aggregate(cbind(A, B) ~ yearmon, DF, mean)
   yearmon    A    B
1 Jan 2012 81.0 4303
2 Feb 2012 32.5 3215

EDIT:

Regarding your question of doing it by week see the nextfri function in the zoo quick reference vignette.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341