1

In R, I have a large dataframe (23344row x 89 col) with sampling locations and entries.

value 1 means: object found in this sampling location value 0 means: object not found this sampling location

To calculate degrees/connections per sampling location (node) I want to, per row, get the rowsum-1 (as this equals number of degrees) and change the 1s in that row to that value. Thereafter I can get the colSum() to calculate total degrees per sample location.

A reproducible example of my dataframe:

loc1 <- c(1,0,1)
loc2 <- c(0,1,1)
loc3 <- c(1,1,0)
loc4 <- c(1,1,0)
loc5 <- c(0,1,0)
df <- data.frame(loc1, loc2, loc3, loc4, loc5)

#    loc1 loc2 loc3 loc4 loc5
# 1  1    0    1    1     0               
# 2  0    1    1    1     1 
# 3  1    1    0    0     0

Desired output looks like this

#    loc1 loc2 loc3 loc4 loc5
# 1  2    0    2    2     0              #rowsum = 3 so change values>1 to 2
# 2  0    3    3    3     3              #rowsum = 4 so change values>1 to 3
# 3  1    1    0    0     0              #rowsum = 2 so change/keep values>1 to 1

I have code that works but it's slow (contains for loop) so is there a better/faster way to do this? I'm aware of the function rowSums() which may be a part of the solution.

My current code is as follows:

for (r in 1:nrow(df)){
    df[r, df[r,] == 1] <- sum(df[r,]) - 1}

degrees_per_sample <- colSums(df)
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • 4
    If you data is all numeric it can be faster to work with matrices. You can do `df* (rowSums(df) - 1)` but will be faster of df is a matrix – user20650 Jul 29 '20 at 12:31

3 Answers3

2

Thought it may be interesting to see the benefit of using matrices instead of data.frames for such stuffs:

set.seed(1)
df = as.data.frame(matrix(rbinom(23344*89,1, 0.5), ncol=89))
m = as.matrix(df) # deliberately did the coercion outside the benchmark

all.equal(as.data.frame(ifelse(df == 1, rowSums(df) - 1, 0)), df* (rowSums(df) - 1))

microbenchmark::microbenchmark(
  a = {ifelse(df == 1, rowSums(df) - 1, 0)},
  b = {df* (rowSums(df) - 1)},
  c = {m* (rowSums(m) - 1)}
)
# Unit: milliseconds
#  expr       min        lq      mean   median        uq      max neval cld
#     a 112.29431 142.70233 165.39007 149.7674 157.63988 304.6195   100  b 
#     b 193.05255 222.24858 245.57206 228.2012 236.38952 402.2677   100   c
#     c  18.49041  26.92273  33.77159  27.3092  27.80769 181.4236   100 a  

**There are differences in the classes of the results which will affect the times.

user20650
  • 24,654
  • 5
  • 56
  • 91
  • 1
    I had not used microbenchmarking before to compare computation times and my default is using dataframes in R rather than matrices, so this suggestion is super helpful and will help me in the future. It is indeed 10x faster than using a dataframe. The checkmark remains with the answer above as it saves me the step of coercing my df. – Caletha Neleh Jul 29 '20 at 13:15
0

You can try using ifelse() on the data frame:

df[] <- ifelse(df == 1, rowSums(df) - 1, 0)

Which gives:

  loc1 loc2 loc3 loc4 loc5
1    2    0    2    2    0
2    0    3    3    3    3
3    1    1    0    0    0
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
0

You can use :

df[] <- +(df > 0) * (rowSums(df) - 1)
df

#  loc1 loc2 loc3 loc4 loc5
#1    2    0    2    2    0
#2    0    3    3    3    3
#3    1    1    0    0    0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213