Is there a faster way to apply logical operations to subset a large dataset in R?

Question

first post on StackOverflow, so be gentle if I don't get the etiquette quite right.

I have a big data frame (well, seven of them actually, but that isn't important) containing hands drawn from a deck of cards. I have another array that goes with it, showing which cards out of the initial hand a player chose to hold. Any cards that were not held, are re-drawn from the deck. The first data frame holds all the drawn cards, so each row can be anywhere between 5 and 10 columns long, for cards held between 5 and 0. Does that make sense? For example:

> str(cards01)
'data.frame':   5044033 obs. of  10 variables

> head(cards01)
   V1  V2  V3  V4  V5  V6  V7 V8  structure(c("", "", "", "", "", ""), class = "AsIs")
1  D0 D10  H0  C5  H1  S3  C4 D6                                                      
2  D5 S10  H7  C7  S0  S5 S12 H5                                                      
3  S4  H4  C1  D4 D11  H6  D1                                                         
4  C3  C9  D9 S10  S2  C7  S3 D2                                                      
5 H11  C0  C6  H3 H12 C11  S0                                                         
6 C10  C9 D11  D8  D5  S8

> str(heldCards01)
 num [1:5044033, 1:5] 1 3 1 2 1 1 2 1 1 1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ ..$ : chr [1:5] "1" "2" "3" "4" ...

> head(heldCards01)
     1 2  3  4  5
[1,] 1 3 NA NA NA
[2,] 3 4 NA NA NA
[3,] 1 2  4 NA NA
[4,] 2 3 NA NA NA
[5,] 1 4  5 NA NA
[6,] 1 2  3  4 NA

So what I'm doing, is making a new data frame that just contains the cards that the player ends up with, ie, remove the cells for each row which aren't identified in the held cards array. I've written code to perform this, but it's now been running all weekend and still hasn't finished. This is the code I'm running (this is all happening in an lapply to go through each of the dataframe/matrix pairs I have, the bit I'm trying to optimize is happening in the mclapply):

all.hands <- lapply(stakes, function(stake){
  cardsOb <- get(paste("cards", stake, sep = ""))
  heldOb <- get(paste("heldCards", stake, sep = ""))
  l <- length(cardsOb[,1])
  mclapply(1:l, function(rowNum) {
    row <- (heldOb[rowNum,])
    theNAs <- as.logical(is.na(row))
    heldIndex <- row[!theNAs]
    discarded <- c(1,2,3,4,5)[-heldIndex]
    if(length(discarded) >= 1) {
      hand <- cardsOb[rowNum,-discarded]
    } else {
      hand <- cardsOb[rowNum,]
    }
    hand <- sort(hand)
  })
})

Are there any functions I'm missing that could cut out some steps? Would it be faster if the data frame was an array? Do I just have to wait for days & days? I'm running on on a Z620 with two Xeon E5-2407 quad core processors and 32GB memory if that matters.

why do you have a `structure(..)` in `head(.)`? could you edit it out? It's better to paste `dput(head(.))`. — Arun, Jun 24 '13 at 11:30
And yes, it's hard to follow what you're trying to accomplish (at least to me). your code uses `stakes` which you've not provided. — Arun, Jun 24 '13 at 11:37
Debug your code with a small deck (e.g. only A thru J of two suits) so you can get some results& see what's going on. Next, how about 'rotating' your data so each hand is a column? That way you can create an N-row by 10-column matrix full of `NA` and write the card values as they show up. Typically this is a lot faster than building up a data.frame inside loops. — Carl Witthoft, Jun 24 '13 at 12:15
Arun - I don't know why the structure bit is there, must be some metadata resulting from the way I read in the data, I don't think it's important. As I said, I have a number of these data frames, they are all named in the form cardsx where x is the stake placed `[1] "025" "1" "10" "2" "20" "30" "40" "5" "50"`, the first lapply is just to apply the later mclapply to each of these data frames. As I said, that bit isn't important. — Bill Beesley, Jun 25 '13 at 09:48
Carl - I've tried using a smaller data set and the script works, these are real player logs so I can't break it down by card or suit really, but I can just use a smaller number of plays. Could you give me a bit more detail on the "write the card values as they show up" method you mentioned? I'm not really sure what you mean. — Bill Beesley, Jun 25 '13 at 09:51

score 0 · Answer 1 · answered Jun 24 '13 at 11:57

0

Here is how I'd do it. for simplicity I assume your initial card holding is in dataframe df1 and the holding card indices are in df2 (just changed names)

The idea is to use rows of df2 as indices to matching rows of df1, and repeat for all rows To avoid class issues, I work with arrays rather than data.frames (which are not very goodas indices)

This can be done in one "geekish" command:

holdings = t(sapply(1:nrow(df1),function(x) as.matrix(df1)[x,][as.matrix(df2)[x,]]))

You can then change the row- and colnames, build a new data.frame, etc.

There are probably nicer ways to do that, but I think the above is quite simple. Feel free to ask if you don't understand something in that command

answered Jun 24 '13 at 11:57

amit

3,332
6
24
32

That's clean and fast, but it only gets me the held cards, not the final hand. The final hand is the held cards + the new cards dealt in the second draw. I'm having a look at the moment about using your method with `rev()` and `head()`, but I think removing the NAs and empty strings is going to be the time consuming bit. Do you know a way of using `scan()` to read the vectors into a list so they can all be different lengths? – Bill Beesley Jun 25 '13 at 11:36
@BillBeesley if each vector is a separate file, then loop over file names and `cardlist[[jj]] <- scan(file.jj)` (pseudocode) will build your list for you. – Carl Witthoft Jun 25 '13 at 13:05

Is there a faster way to apply logical operations to subset a large dataset in R?

1 Answers1