0

I'm doing t-tests with multiple grouping variables (markers) which only have two groups (0 or 1). In the complete data there are a million grouping variables, eg n_obs = 1e+06, nvals=300, 5% NA.

> n_obs = 1e+04 # to simulate grouping matrix
> n_vals = 100
> g = matrix(sample(0:1, n_obs * n_vals, replace=TRUE), n_obs, n_vals)
> row.names(g) = paste("marker", 1:nrow(g), sep="")
> colnames (g) = paste("country", 1:ncol(g), sep="")
> g[1:5,1:2]
    country1 country2 country3 country4 country5
marker1        1        1        1        1        0
marker2        1        0        0        0        0

> vals = rnorm (n_vals) ; names(vals) = colnames(g) # to simulate values
> head(vals)
  country1   country2   country3   country4   country5   country6 
-0.4048584  0.2792725  0.4064460  0.9002677  0.2187961  0.2141666 

> res = apply(g, 1, function(x) t.test(vals~ x)) ## applying the t-tests. Quite slow.

> tres = do.call(rbind, lapply(res, tidy)) ## tidying the t-tests. Very slow :(
> head(tres)
       estimate   estimate1   estimate2   statistic   p.value parameter   conf.low
marker1 -0.03560203 -0.07373907 -0.03813704 -0.17495425 0.8615063  90.52404 -0.4398452
marker2  0.27284988  0.07194537 -0.20090451  1.33127950 0.1863794  92.20240 -0.1341928

Because the tidy is so slow with larger data-sets, I was thinking of doing the t-test in separate parts, and looping through 'g' row-by-row, to generate each component of the t-test.

I can 'split' the values for the first marker, and then get the sums for each group:

> mysplit = split( vals, g[1,])
> lapply(mysplit, mean)
$`0`
[1] -0.07373907
$`1`
[1] -0.03813704

How can I 'loop' through all of the rows of 'g', getting the sums of 'vals' for each group, then the standard deviation etc.?

I'm trying to keep functions simple for speed.

Sarah
  • 67
  • 1
  • 8
  • What is this tidy function? You could speed up the t.test apply using parallel. But it looks like tidy is your own function so maybe that can be optimised. – JeremyS Feb 23 '16 at 03:52
  • The tidy function is from the broom package https://cran.r-project.org/web/packages/broom/vignettes/broom.html – Sarah Feb 23 '16 at 09:54
  • I could've added actually that I am using the parallel package. It speeds up the t-test but not the tidy. – Sarah Feb 23 '16 at 09:55
  • I don't know why tidy takes so long, but all the information you want should already be in the list of lists output from the t.test, you just need to access it. For example you can access the p_value using `unlist(lapply(res,function(x) x$p.value))` and add that vector to a data.frame – JeremyS Feb 25 '16 at 03:49

0 Answers0