Randomly picking an increasing number of columns and summing up unique counts at each

Question

I have two separate tab delimited data sets, datA and datB.

datA looks like this, with the entries in the source column going all the way to 10 million rows

source    Bin1    Bin2    Bin3    Bin4    Bin5
  A         1       1       2       2       3
  B         1       1       1       1       1
  C         0       0       0       1       0
  D         0       0       2       0       0
  E         4       0       0       1       0
  F         1       0       1       2       1
  G         0       5       0       0       0

DatB looks like this with up to 70 rows

Bins    readcounts
Bin1         100 
Bin2         40
Bin3         200
Bin4         150
Bin5         320

I would like to do the following in R

Randomly pick N samples from datA (i.e. one sample, then two samples, then three samples, etc, up to N). I want to bootstrap the process as well.
Calculate the sum of unique entries for the randomly picked samples (e.g. in Bin1 (for one sample), Bin3 and Bin4 (for two samples), etc. such that
- In the one-sample-case (Bin1) all entries would be unique
- In the two-sample-case (Bin3 and Bin4) the unique entries are those not shared by the two randomly picked samples (1 + 2 = 3).
Finally, I want to get the sum of read counts in datB related to the randomly picked samples in datA

The results would look like this:

RandomlyPicked_samples Sum_uniq_entries Total_readcounts
1         7        100
2         3        350 
..

However, I also want to get the mean and standard deviation for the bootstrap at each randomized step.

I can already get the sum of shared entries and unique entries for per column versus all using this commands in R

df <- data.frame(
    "source" = LETTERS[1:7],
    "Bin1" = c(1,1,0,0,4,1,0),
    "Bin2" = c(1,1,0,0,0,0,5),
    "Bin3" = c(2,1,0,2,0,1,0),
    "Bin4" = c(2,1,1,0,1,2,0),
    "Bin5" = c(3,1,0,0,0,1,0)
)

colsum_of_shared_entries <- colSums(df[which(apply(df[,-1], 1, function(x) all(x > 0))), -1])

colsum_of_shared_entries
# Bin1 Bin2 Bin3 Bin4 Bin5 
#  2    2    3    3    4

sum_of_unique_counts <- colSums(df[which(lapply(apply(df[,-1], 1, function(x) which(x > 0)), length) == 1),-1])

sum_of_unique_counts
# Bin1 Bin2 Bin3 Bin4 Bin5 
#   0    5    2    1    0

Any help would be much appreciated.

Any additional information should be added to the question. I've tried to do it for you, but please check. — Borodin, Apr 03 '17 at 12:44
You ortiginally had a Perl tag on this question but it has been removed. Why was it there? Do you want a Perl solution? — Borodin, Apr 03 '17 at 12:45
*" I want to bootstrap the process as well"* What does this mean? — Borodin, Apr 03 '17 at 12:47
Thanks Borodin, maybe the term isnt "bootstrap", but essentially I want to do the random picking like 100 times, and then calculate the mean (plus SD) for all randomly picked samples at each step (i.e., 1 sample, 2 samples, 3 samples, n samples) — Daudi, Apr 03 '17 at 12:52
I put "perl" since I thought there might also be an easier solution using Perl programming. — Daudi, Apr 03 '17 at 12:56
When you say you want to "sample" do you mean selecting a subset of rows? I don't get how the Bin1 - Bin5 relate to your concept of sampling (i.e what does "In the two-sample-case (Bin3 and Bin4) the unique entries are those not shared by the two randomly picked samples (1 + 2 = 3)." mean?) — TARehman, Apr 03 '17 at 14:48
Thanks TARehman. I want to randomly select a subset of column or column(s), such that I can calculate the sum of unique entries (in the rows) for these randomly selected columns disregarding the other columns. And I want also to repeat each random selection process such that I can calculate the mean and SD for each subset of column(s). For instance, based on one column, two columns, three columns, .... up to 70 columns. — Daudi, Apr 04 '17 at 05:33

Randomly picking an increasing number of columns and summing up unique counts at each

0 Answers0