0

I've been trying for quite some time to get my test data to split.

> FDF <- read.csv.ffdf(file='C:\\Users\\William\\Desktop\\R Data\\TestData0812.txt', header = FALSE, colClasses=c('factor','factor','numeric','numeric','numeric','numeric'), sep=',')
> names(FDF)<- c('Date','Time','Open','High','Low','Close')
> 
> # ID
> FDF2 <-FDF[1:100,]
> FDF2 <- as.ffdf(FDF2)
> a <- nrow(FDF2)
> # Take section of import for testing
> FDF2[1:3,]
        Date  Time   Open   High    Low  Close
1 1987.08.28 12:00 1.6238 1.6240 1.6237 1.6239
2 1987.08.28 12:01 1.6239 1.6240 1.6235 1.6236
3 1987.08.28 12:02 1.6236 1.6239 1.6235 1.6238
> 
> ID <- data.frame(matrix(1:a, nrow = a, ncol=1 ))
> ID <- as.ffdf(ID)
> names(ID) <- c('ID')
> FDF3 <- cbind.ffdf2(ID, FDF2)
> # Create ID column and binds together
> FDF3[1:3,]
  ID       Date  Time   Open   High    Low  Close
1  1 1987.08.28 12:00 1.6238 1.6240 1.6237 1.6239
2  2 1987.08.28 12:01 1.6239 1.6240 1.6235 1.6236
3  3 1987.08.28 12:02 1.6236 1.6239 1.6235 1.6238

The file I will be using this on is an ffdf object, as it is 700mb. I would like to know how I could split the dataset?

My current code is;

T = ffdfdply(FDF3, split(FDF3$ID, rep(1:10,each=10)))

I have done quite a few variation of this and research across the forum and other. However, for simplicity I've just included the above example.

Upon operation the code above gives me the following error;

Error in ffdfdply(FDF3, split(FDF3$ID, rep(1:10, each = 10))) :
split needs to be the same length as the number of rows in x

I can't seem to understand why a split of rep(1:10, each = 10) is not working in a data set that is > dim(FDF3) [1] 100 7

I would like the split to perform even if there are not a full amount of rows for each split also, lets say: T = ffdfdply(FDF3, split(FDF3$ID, rep(1:10,each=3)))

I've been on this for at least 20 hours.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
  • The first argument to `split(...)` has to be a data frame, you are passing a vector (FDF3$ID). The "number of rows" in a vector (e.g., nrows(FDF3$ID)) is `NULL`. Try `split(FDF3,rep(1:10,each=10))`. – jlhoward Dec 08 '14 at 17:34
  • @jlhoward - thank you for the reply. I have tried that. However, I've given it a run again; `T = ffdfdply(FDF3, split(FDF3, rep(1:10,each=10)))` and received `Error in ffdfdply(FDF3, split(FDF3, rep(1:10, each = 10))) : split needs to be the same length as the number of rows in x In addition: Warning message: In split.default(FDF3, rep(1:10, each = 10)) : data length is not a multiple of split variable` Any ideas? –  Dec 08 '14 at 21:22
  • What do you expect `T` to be exactly? A list of `ffdf` objects? A list of data frames? `ffdfdply(...)` is a "split-apply-combine" function, so it requires a `FUN=...` argument (see the documentation). – jlhoward Dec 08 '14 at 22:26
  • @jlhoward - I had was expected `T`(test) to be separate ffdf objects accessible in a similar way to the split function; for ffdf `1 = T$'1', 2 = T$'2'` and so on, and so forth. I read the documentation, however never realized that the 3 operations where co-dependent i.e. you couldn't use split without an apply or combine argument. Would I be correct in assuming this? I think it may be more efficient to create a subset loop, if this was the case? Would you have any thoughts about the suitability of my method for my task? Thanks again. –  Dec 09 '14 at 14:18

1 Answers1

0

I couldn't figure out the correct usage of the ffdfdplyr package, and I am still unaware of whether it would have been a correct usage or not. However, I have constructed a work around and hope someone finds it useful. I would add, it is indeed ugly, therefore I'm open to suggestion on how to simply this and would appreciate your comments.

ffdfEnd <- 5 
# Variable
ffdfrows = nrow(FDF3)
ffdfStart <- 1 
ffdfLoop <- ffdfStart 
ffdfSplitSize <- ffdfEnd
# Creates constants and varaibles

splitNum <- ffdfrows/ffdfEnd
# Calculates the number of split required
ffdf.names <- paste('FFDF', ffdfSplitSize, ffdfLoop:splitNum,sep='.')
# Creates names to be pasted to resulting tables

for (i in ffdfLoop:splitNum) {
        assign(ffdf.names[i], as.ffdf(FDF3[ffdfStart:ffdfEnd,]))
        ffdfStart = (ffdfEnd)
        ffdfEnd = (ffdfEnd + ffdfSplitSize)}
# loops over until requirments are fulfilled`