-2

For each row in an input data frame I want to create a new data frame consisting of all the rows in my data with the same RETAIL_WEEK value and DAY_OF_WEEK value. For instance, if I had matchday = matchweek = 3, the following can be used to find the desired data frame:

library(sqldf); library(gsubfn) # second one may not be needed.
fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek")
     CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
1 1/20/2009           3        2009    2334547   FALSE           3
2 1/19/2010           3        2010    9854269   FALSE           3
3 1/18/2011           3        2011    1951332   FALSE           3
4 1/17/2012           3        2012    8419327    TRUE           3
5 1/15/2013           3        2013    7788004    TRUE           3
6 1/14/2014           3        2014    2130731    TRUE           3

But I want scroll through rows and want to return a list of data frames where each data frame consists of matches for that particular row. For some reason this code does not produce the desired output:

  find_dates <- function(file,length){
  data <- alignment(file)
  matches <- list()
  #extract dataset from file and split by aligned dates
  for (i in 1:8){
    #find matching days with corresponding day_of_week and retail_week
    matchday <- data[i,]$DAY_OF_WEEK
    matchweek <- data[i,]$RETAIL_WEEK
    matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek")
  }
  return(matches)
}

But then I get

[[1]]
character(0)

[[2]]
character(0)

[[3]]
character(0)

[[4]]
[1] "1/4/2009"   "1/3/2010"   "1/2/2011"   "1/1/2012"   "12/30/2012" "12/29/2013"

[[5]]
[1] "1/11/2009" "1/10/2010" "1/9/2011"  "1/8/2012"  "1/6/2013"  "1/5/2014" 

[[6]]
[1] "1/18/2009" "1/17/2010" "1/16/2011" "1/15/2012" "1/13/2013" "1/12/2014"

[[7]]
[1] "1/25/2009" "1/24/2010" "1/23/2011" "1/22/2012" "1/20/2013" "1/19/2014"

[[8]]
[1] "2/1/2009"  "1/31/2010" "1/30/2011" "1/29/2012" "1/27/2013" "1/26/2014"

Warning messages:
1: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
2: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
3: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
4: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
5: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
6: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
7: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
8: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length

Can someone tell me where I'm going wrong here? I also don't understand why I'm getting the warnings.

josliber
  • 43,891
  • 12
  • 98
  • 133
Kashif
  • 3,063
  • 6
  • 29
  • 45
  • 1
    Do `matches[[i]] <- fn$sqldf(...)` (note the double brackets to assign elements *in* the list). – MrFlick Jun 05 '15 at 23:20
  • You should have indicated using code that this was with the sqldf-package loaded (which loads the gsubfn package that you are demonstrating – IRTFM Jun 05 '15 at 23:21
  • @MrFlick if what does that mean? Double brackets assign elements in the list but don't single brackets do the same? What's the difference? – Kashif Jun 06 '15 at 01:01
  • @Glassjawed try reading http://adv-r.had.co.nz/Subsetting.html and see if that helps – MrFlick Jun 06 '15 at 03:07
  • The `library(gsubfn)` statement is unnecessary. The sqldf package already pulls it in. – G. Grothendieck Jun 06 '15 at 13:56

1 Answers1

1

Usually if you want to split up a data frame in R into a list of data frames you would use the split function. For instance, if you wanted to split up some sample data based on the retail week and the day of the week, you would use:

# Sample data
data <- read.table(text="     CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
1 1/20/2009           3        2009    2334547   FALSE           1
2 1/19/2010           3        2010    9854269   FALSE           1
3 1/18/2011           3        2011    1951332   FALSE           2
4 1/17/2012           4        2012    8419327    TRUE           2
5 1/15/2013           4        2013    7788004    TRUE           2
6 1/14/2014           4        2014    2130731    TRUE           1", header=TRUE)
spl <- split(data, paste(data$RETAIL_WEEK, data$DAY_OF_WEEK))
# $`3 1`
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009           3        2009    2334547   FALSE           1
# 2 1/19/2010           3        2010    9854269   FALSE           1
# 
# $`3 2`
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 3 1/18/2011           3        2011    1951332   FALSE           2
# 
# $`4 1`
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 6 1/14/2014           4        2014    2130731    TRUE           1
# 
# $`4 2`
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 4 1/17/2012           4        2012    8419327    TRUE           2
# 5 1/15/2013           4        2013    7788004    TRUE           2

This is a list with the names of the list equal to the retail week followed by a space followed by the day of the week. You can access individual data frames with:

spl[["3 1"]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009           3        2009    2334547   FALSE           1
# 2 1/19/2010           3        2010    9854269   FALSE           1

In your question you actually want a list of data frames where each entry corresponds to a row in your original data frame and the data is all rows that have the same retail week and day of the week. This can now be done with simple indexing on your list:

(processed <- unname(spl[paste(data$RETAIL_WEEK, data$DAY_OF_WEEK)]))
# [[1]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009           3        2009    2334547   FALSE           1
# 2 1/19/2010           3        2010    9854269   FALSE           1
# 
# [[2]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009           3        2009    2334547   FALSE           1
# 2 1/19/2010           3        2010    9854269   FALSE           1
# 
# [[3]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 3 1/18/2011           3        2011    1951332   FALSE           2
# 
# [[4]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 4 1/17/2012           4        2012    8419327    TRUE           2
# 5 1/15/2013           4        2013    7788004    TRUE           2
# 
# [[5]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 4 1/17/2012           4        2012    8419327    TRUE           2
# 5 1/15/2013           4        2013    7788004    TRUE           2
# 
# [[6]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 6 1/14/2014           4        2014    2130731    TRUE           1

As you can see, both the first and second entries have the first two rows, the third has only the third row, etc. You can access data frames for the particular rows with processed[[1]], processed[[2]], ...

Though obviously I can't test this because I don't have your data, I would imagine loading all the data at the beginning into a data frame, splitting, and finally grabbing the appropriate parts would be quicker than doing an individual sql query for each input row.

josliber
  • 43,891
  • 12
  • 98
  • 133