0

can somebody help with debugging a function. It is meant to do

dat3 <- c(4,7,5,7,8,4,4,4,4,4,4,7,4,4,8,8,5,5,5,5)

myfunc(dat3, chunksize = 8)
##  [1] 4 7 5 8 4 4 4 4   4 7 5 8 4 4 5 5   4

partition the data in chunks of a sizer and make sure that every level is present in every chunk. The function works for the toy example

myfunc <- function(x, chunksize = 8) {
    numChunks <- ceiling(length(x) / chunksize)
    uniqx <- unique(x)
    lastChunkSize <- chunksize * (1 - numChunks) + length(x)
    ## check to see if it is mathematically possible
    if (length(uniqx) > chunksize)
        stop('more factors than can fit in one chunk')
    if (any(table(x) < numChunks))
        stop('not enough of at least one factor to cover all chunks')
    if (lastChunkSize < length(uniqx))
        stop('last chunk will not have all factors')
    ## actually arrange things in one feasible permutation
    allIndices <- sapply(uniqx, function(z) which(z == x))
    ## fill one of each unique x into chunks
    chunks <- lapply(1:numChunks, function(i) sapply(allIndices, `[`, i))
    remainder <- unlist(sapply(allIndices, tail, n = -3))
    remainderCut <- split(remainder, ceiling(seq_along(remainder)/4))
    ## combine them all together, wary of empty lists
    finalIndices <- sapply(1:numChunks,
           function(i) {
               if (i <= length(remainderCut))
                   c(chunks[[i]], remainderCut[[i]])
               else
                   chunks[[i]]
           })
           save(finalIndices,file="finalIndices")
    x[unlist(finalIndices)]

}

the problem is that I want to get the rearranged indixes from the function (so what is called here final Indices). The problem is that for my real data set with more observations (https://www.dropbox.com/s/n3wc5qxaoavr4ta/j.RData?dl=0), the function does not work.

The data as factor https://www.dropbox.com/s/0ue2xzv5e6h858q/t.RData?dl=0

I change the chunkszie paramter according to the number of levels present to 9847 I in the first line of the function). The problem is that when I access finalIndices from the saved file, I get a matrix with dim 137 60. Which does not provide an index for all my observations (nearly 600k). Could somebody tell me what am i doing wrong? I know that 60 is the number of chunks (nrows/chunksize) but 137 appear not to fit.

wery
  • 1
  • 3
  • I think you're working awfully hard. But first thing: why do you want your data to be `factors`, and the sample you provided is **not** factors. If you want all values to appear in all chunks, I'd recommend first doing a sort on the input and then distributing each unique value across your output "chunks" . – Carl Witthoft Jan 11 '15 at 15:27
  • so let's back up a bit. What is the actual problem at hand? I'm willing to bet an internet or two that there's a much more straightforward way to do the sifting/sorting that you're aiming for. So can you explain what the context of this problem is? – Carl Witthoft Jan 11 '15 at 15:29
  • I also updated the question and included the factor as t.RData – wery Jan 11 '15 at 16:05
  • @Carl the data are read into memory chunkwise by a statistical model which requires that each chunk has all levels of the factor. That is also why i thought about this function. – wery Jan 11 '15 at 16:28
  • Sounds like an ill-formed statistical model then, as re-grouping by factor can change the apparent distribution. – Carl Witthoft Jan 11 '15 at 17:59
  • it is just reordering the observations...not grouping by factr but just trying that ever level of every factor e is present in every chunk – wery Jan 11 '15 at 19:47

1 Answers1

1

The line remainderCut <- split(remainder, ceiling(seq_along(remainder)/4)) is hard-coded to the toy data set and just adds four elements to each chunk, which produces wrong results for other data sets.

Whereas this problem can be fixed by modifying your code, I have come up with a slightly different approach to this problem:

library(data.table)

generate.chunks <- function(dat3, chunksize = 8) {
    # get number of unique values
    freqs <- table(dat3)

    # get chunk sizes
    chunk.sizes <- rep(chunksize,length(dat3) %/% chunksize)    
    last.chunk.size <-  length(dat3) %% chunksize
    if (last.chunk.size > 0) chunk.sizes <- c(chunk.sizes,last.chunk.size)

    # few checks
    if (chunksize < length(freqs)) 
        stop(sprintf('Chunk size is smaller than the number of factors: %i elements in a chunk, %i factors. Increase the chunk size',chunksize,length(freqs)))
    if (chunk.sizes[length(chunk.sizes)] < length(freqs)) 
        stop(sprintf('Last chunk size is smaller than the number of factors: %i elements in the chunk, %i factors. Use a different chunk size',chunksize,length(freqs)))
    if (min(freqs) < length(chunk.sizes))
        stop(sprintf('Not enough values in a factor to populate every chunk: %i < %i. Increase the chunk size',min(freqs),length(chunk.sizes)))

    # make sure that each chunk has at least one factor
    d.predefined <- data.frame(
            chunk = rep(1:length(chunk.sizes),each=length(freqs)),
            i     = rep(1:length(freqs),length(chunk.sizes))
    )

    # randomly distribute the remaining values
    d.sampled <- data.frame(
        chunk = unlist(mapply(rep,1:length(chunk.sizes),chunk.sizes - length(freqs),SIMPLIFY=F)),
        i     = sample(unlist(mapply(rep,1:length(freqs),freqs - length(chunk.sizes))))
    )

    # put the predefined and sampled results together and split
    d.result <- rbind(d.predefined,d.sampled)

    # calculate indices
    indices <- sapply(names(freqs),function(s) which(dat3==s))
    dt <- as.data.table(d.result)
    dt[,ind:=indices[[i]],by=i]
    finalIndices <- split(dt$ind,dt$chunk)
    save(finalIndices,file="finalIndices")

    names(freqs)[d.result$i]
}
Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
  • oh okay..but could you tell me w how i cn obtain the indices of how to re arrange the elements from your approach? It woudl be got to get the indices to use them for th e complete table instead of a single column ... – wery Jan 11 '15 at 23:53
  • oh, you do need them... I've modified the code to calculate them. – Marat Talipov Jan 12 '15 at 00:47