1

In SAS its possible to go through a dataset and used lagged values.

The way I would do it is to use a function that does a "lag", but this presumably would produce a wrong value at the beginning of a chunk. For example if a chunk starts at row 200,000, then it will assume an NA for a lagged value that should come instead from row 199,999.

Is there a solution for this?

grad student
  • 107
  • 7

2 Answers2

0

You're exactly right about the chunking problem. The workaround is to use rxGet and rxSet to pass values between chunks. Here's the function:

lagVar <- function(dataList) { 

     # .rxStartRow returns the overall row number of the first row in this
     # chunk. So - the first row of the first chunk is equal to one.
     # If this is the very first row, there's no previous value to use - so
     # it's just an NA.
     if(.rxStartRow == 1) {

        # Put the NA out front, then shift all the other values down one row.
        # newName is the desired name of the lagged variable, set using
        # transformObjects - see below
        dataList[[newName]] <- c(NA, dataList[[varToLag]][-.rxNumRows]) 

    } else {

        # If this isn't the very first chunk, we have to fetch the previous
        # value from the previous chunk using .rxGet, then shift all other
        # values down one row, just as before.
        dataList[[newName]] <- c(.rxGet("lastValue"),
                                 dataList[[varToLag]][-.rxNumRows]) 

      }

    # Finally, once this chunk is done processing, set its lastValue so that
    # the next chunk can use it.
    .rxSet("lastValue", dataList[[varToLag]][.rxNumRows])

    # Return dataList with the new variable
    dataList

}

and how to use it in rxDataStep:

# Get a sample dataset
xdfPath <- file.path(rxGetOption("sampleDataDir"), "DJIAdaily.xdf")

# Set a path to a temporary file
xdfLagged <- tempfile(fileext = ".xdf")

# Sort the dataset chronologically - otherwise, the lagging will be random.
rxSort(inData = xdfPath,
       outFile = xdfLagged,
       sortByVars = "Date")

# Finally, put the lagging function to use:
rxDataStep(inData = xdfLagged, 
           outFile = xdfLagged,
           transformObjects = list(
               varToLag = "Open", 
               newName = "previousOpen"), 
           transformFunc = lagVar,
           append = "cols",
           overwrite = TRUE)

# Check the results
rxDataStep(xdfLagged, 
           varsToKeep = c("Date", "Open", "previousOpen"),
           numRows = 10)
Nimantha
  • 6,405
  • 6
  • 28
  • 69
Matt Parker
  • 26,709
  • 7
  • 54
  • 72
  • Awesome. One finer point ... It seems unnecessary to write the extra new variable when one can pass .rxSet "just in time"? Also, what does the notation -.rxNumRows do? Does it mean .rxNumRows-1? – grad student Jun 17 '15 at 18:48
  • @APK Sorry, which "extra new" variable do you mean? And - excellent question regarding `.rxNumRows`. That's a special variable (see `?rxTransforms` for related special variables) that returns the number of rows in the current chunk, which I then negate (`-.rxNumRows`) to remove the last element from the variable. A simple example in base R is something like `x <- 1:10; x[-length(x)]`, which drops the last element of `x` without requiring you to know how long `x` is beforehand. Hope that makes sense! I feel like I'm not explaining it well... – Matt Parker Jun 18 '15 at 22:52
  • Basically, `dataList[[varToLag]][.rxNumRows])` returns the very last value in the variable, and `dataList[[varToLag]][-.rxNumRows])` returns all of the values *except* the last one. – Matt Parker Jun 18 '15 at 22:56
  • I think I get what's going on, thanks! This is really cool. Unfortunately, if I may, documentation is thin. – grad student Jun 26 '15 at 00:10
  • @APK You may! It is. :( – Matt Parker Jun 26 '15 at 13:01
  • Someone emailed me to ask how you'd do this if you wanted to lag within multiple groups - here's the approach I'd use: https://gist.github.com/mmparker/8aca803eae5410875a21 – Matt Parker Feb 08 '16 at 18:08
0

Here's another approach for lagging: self-merging using a shifted date. This is dramatically simpler to code and can lag several variables at once. The downsides are that it takes 2-3 times longer to run than my answer using transformFunc, and requires a second copy of the dataset.

# Get a sample dataset
sourcePath <- file.path(rxGetOption("sampleDataDir"), "DJIAdaily.xdf")

# Set up paths for two copies of it
xdfPath <- tempfile(fileext = ".xdf")
xdfPathShifted <- tempfile(fileext = ".xdf")


# Convert "Date" to be Date-classed
rxDataStep(inData = sourcePath,
           outFile = xdfPath,
           transforms = list(Date = as.Date(Date)),
           overwrite = TRUE
)


# Then make the second copy, but shift all the dates up 
# one (or however much you want to lag)
# Use varsToKeep to subset to just the date and 
# the variables you want to lag
rxDataStep(inData = xdfPath,
           outFile = xdfPathShifted,
           varsToKeep = c("Date", "Open", "Close"),
           transforms = list(Date = as.Date(Date) + 1),
           overwrite = TRUE
)

# Create an output XDF (or just overwrite xdfPath)
xdfLagged2 <- tempfile(fileext = ".xdf")

# Use that incremented date to merge variables back on.
# duplicateVarExt will automatically tag variables from the 
# second dataset as "Lagged".
# Note that there's no need to sort manually in this one - 
# rxMerge does it automatically.
rxMerge(inData1 = xdfPath,
        inData2 = xdfPathShifted,
        outFile = xdfLagged2,
        matchVars = "Date",
        type = "left",
        duplicateVarExt = c("", "Lagged")
)
Matt Parker
  • 26,709
  • 7
  • 54
  • 72
  • And it breaks if your dataset has gaps - for example, `Date + 1` on a Friday gives you a Saturday, not a Monday. You could get around that with something like `if(format(Date, "%A") %in% "Friday") { Date + 3 } else { Date + 1}`, but you'd have to add logic like that to account for every holiday, etc. :( – Matt Parker Jan 19 '16 at 17:21