I have two separate tab delimited data sets, datA
and datB
.
datA
looks like this, with the entries in the source
column going all the way to 10 million rows
source Bin1 Bin2 Bin3 Bin4 Bin5
A 1 1 2 2 3
B 1 1 1 1 1
C 0 0 0 1 0
D 0 0 2 0 0
E 4 0 0 1 0
F 1 0 1 2 1
G 0 5 0 0 0
DatB
looks like this with up to 70 rows
Bins readcounts
Bin1 100
Bin2 40
Bin3 200
Bin4 150
Bin5 320
I would like to do the following in R
Randomly pick N samples from
datA
(i.e. one sample, then two samples, then three samples, etc, up to N). I want to bootstrap the process as well.Calculate the sum of unique entries for the randomly picked samples (e.g. in
Bin1
(for one sample),Bin3
andBin4
(for two samples), etc. such that- In the one-sample-case (
Bin1
) all entries would be unique - In the two-sample-case (
Bin3
andBin4
) the unique entries are those not shared by the two randomly picked samples (1 + 2 = 3).
- In the one-sample-case (
Finally, I want to get the sum of read counts in
datB
related to the randomly picked samples indatA
The results would look like this:
RandomlyPicked_samples Sum_uniq_entries Total_readcounts
1 7 100
2 3 350
..
However, I also want to get the mean and standard deviation for the bootstrap at each randomized step.
I can already get the sum of shared entries and unique entries for per column versus all using this commands in R
df <- data.frame(
"source" = LETTERS[1:7],
"Bin1" = c(1,1,0,0,4,1,0),
"Bin2" = c(1,1,0,0,0,0,5),
"Bin3" = c(2,1,0,2,0,1,0),
"Bin4" = c(2,1,1,0,1,2,0),
"Bin5" = c(3,1,0,0,0,1,0)
)
colsum_of_shared_entries <- colSums(df[which(apply(df[,-1], 1, function(x) all(x > 0))), -1])
colsum_of_shared_entries
# Bin1 Bin2 Bin3 Bin4 Bin5
# 2 2 3 3 4
sum_of_unique_counts <- colSums(df[which(lapply(apply(df[,-1], 1, function(x) which(x > 0)), length) == 1),-1])
sum_of_unique_counts
# Bin1 Bin2 Bin3 Bin4 Bin5
# 0 5 2 1 0
Any help would be much appreciated.