0

I am trying to generate a lot of test data for other programs. Working in R Studio I import an SPSS sav file which has 73 variables and the values and labels recorded in it using Haven as a dataframe "td". This gives me all the variable names which I need to work with. Then I delete all the existing data.

td <- td[0,]

Then I generate 10,000 test data rows by loading the index IDs

td$ID <- 12340000:12349999

So far so good.

I have a constant called ThismanyRows <- 10000 I have a large list of Column header names in a variable called BinaryVariables And a vector of valid values for it called CheckedOrNot <- c(NA, 1)

This is where the problem is:

td[,BinaryVariables] <- sample(x = CheckedOrNot, size= ThismanyRows, replace = TRUE)

does fill all the columns with data. But its all exactly the same data, which isn't what I want. I want the sample function to run against each column, but not each value in each column as in.

Even when

Fillbinary <- function () {sample(x = CheckedOrNot, size= ThismanyRows, replace = TRUE)}

and

td <- lapply(td[,BinaryVariables],Fillbinary) generates: Error in FUN(X[[i]], ...) : unused argument (X[[i]])

So far I have not been able to work out how to deal with each column as a column and apply the sample function to it.

Any help much appreciated.

Peter King
  • 91
  • 8
  • 2
    You are generating 10 values and feeding that in to replace 3 * 10 values. Adjust your sample to `size=ThismanyRows*length(BinaryVariables)` – thelatemail Jul 29 '21 at 04:04

1 Answers1

1

Let's generate some fake data first for the example:

BinaryVariables <- c("v1","v2","v3")
CheckedOrNot <- c(NA, 1)
ThismanyRows <- 10

td <- data.frame(ID=1:10)

The issue is that you are generating 10 values and feeding that in to replace 3 * 10 values.

There's a couple of ways to solve this. You might initially think, well, I'll generate 10 values 3 times, like so:

td[BinaryVariables] <-  replicate(length(BinaryVariables),
                          sample(x = CheckedOrNot, size=ThismanyRows, replace=TRUE),
                        simplify=FALSE)

That will work fine, but why sample 3 times if you can sample once and fill once?

td[BinaryVariables] <- sample(x = CheckedOrNot, 
                              size=ThismanyRows*length(BinaryVariables), replace = TRUE)

And the (well, a) result shows that the values in each column are different:

#   TD v1 v2 v3
#1   1 NA  1  1
#2   2 NA  1  1
#3   3 NA  1 NA
#4   4 NA  1 NA
#5   5  1 NA  1
#6   6 NA  1  1
#7   7  1 NA  1
#8   8  1  1 NA
#9   9  1 NA NA
#10 10  1 NA NA
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • 1
    @PeterKing - why would you want to loop over column names to apply a function that doesn't use the column names (hence the error that you have an unused argument)? I think you want something like `replicate(length(BinaryVariables), Fillbinary(), simplify=FALSE)` – thelatemail Jul 30 '21 at 04:11
  • 1
    @PeterKing - your question can be reframed as - how do I replicate the results of a function `n` column times to replace `n` columns? I mean, you could do exactly what you ask and `lapply(BinaryVariables, function(x) sample(CheckedOrNot, size= ThismanyRows, replace = TRUE))`, defining a function where the `x` argument never gets used, but that's just confusing. – thelatemail Jul 30 '21 at 04:25
  • Just to note: td[BinaryVariables] <- sample(x = CheckedOrNot, size= ThismanyRows*length(td), replace = TRUE) Returned Error: Assigned data `sample(x = CheckedOrNot, size = ThismanyRows * length(td), replace = TRUE)` must be compatible with existing data. x Existing data has 10000 rows. x Assigned data has 730000 rows. i Only vectors of size 1 are recycled. Run `rlang::last_error()` to see where the error occurred. – Peter King Aug 03 '21 at 04:21
  • The lapply solution works without error, as does the replicate version Happy to accept either in body of answer as solution though I think the lapply version is closer to what I was looking for. Thank you. – Peter King Aug 03 '21 at 04:26
  • Works fine for me. Are you working with a tibble or a plain old data.frame? I suspect a tibble since it's returning tidyverse (rlang) errors. – thelatemail Aug 03 '21 at 04:27
  • My library calls are expss, openxlsx, magrittr and dplyr so perhaps its been coerced into a tibble somewhere without me doing so explicitly. Not a huge fan of tibbles. – Peter King Aug 03 '21 at 04:30
  • You also used `ThismanyRows * length(td)` instead of `ThismanyRows * length(BinaryVariables)` – thelatemail Aug 03 '21 at 04:30
  • Just copied your code and ran it again. Get the same error, I'm afraid. – Peter King Aug 03 '21 at 04:33
  • @PeterKing - I've just tried in R4.1 and and R3.6 in totally fresh sessions in both Windows and Linux and it worked each time. – thelatemail Aug 03 '21 at 04:36