How to convert frequency distribution to probability distribution in R

Question

I have a matrix with n rows of observations. Observations are frequency distributions of the features. I would like to transform the frequency distributions to probability distributions where the sum of each row is 1. Therefore each element in the matrix should be divided by the sum of the row of the element.

I wrote the following R function that does the work but it is very slow with large matrices:

prob_dist <- function(x) {

    row_prob_dist <- function(row) {
       return (t(lapply(row, function(x,y=sum(row)) x/y)))
       }

    for (i in 1:nrow(x)) {
       if (i==1) p_dist <- row_prob_dist(x[i,])
       else p_dist <- rbind(p_dist, row_prob_dist(x[i,]))
       }
    return(p_dist)
}

B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)
B
     [,1] [,2]
[1,]    2    1
[2,]    4    5
[3,]    3    7

prob_dist(B)
     [,1]      [,2]    
[1,] 0.6666667 0.3333333
[2,] 0.4444444 0.5555556
[3,] 0.3       0.7

Could you suggest R function that does the job and/or tell me how can I optimise my function to perform faster?

A general point: since you made the first row a special case, calculate it outside your loop and do `for( in 2:nrow(x)) ` and delete the `if/else` inside the loop. Next, since you know the dimension of your output matrix in advance, create an empty `p_dist<-matrix(NA,nrow=nrow(x),ncol=ncol(x))` . All those `rbind` s waste time. — Carl Witthoft, Feb 01 '15 at 21:21
@DavidArenburg you might want to mention that `prop.table` is just a shortcut for `sweep` — Carl Witthoft, Feb 01 '15 at 21:22

Steven Beaupré · Accepted Answer · 2015-02-03T16:00:40.840

6

Here's an attempt, but on a dataframe instead of a matrix:

df <- data.frame(replicate(100,sample(1:10, 10e4, rep=TRUE)))

I tried a dplyr approach:

library(dplyr)
df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)

Here are the results:

library(microbenchmark) 
mbm = microbenchmark(
dplyr = df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs),
t = t(t(df) / rep(rowSums(df), each=ncol(df))),
apply = t(apply(df, 1, prop.table)),
times = 100
)

enter image description here

#> mbm
#Unit: milliseconds
#  expr       min        lq      mean    median        uq       max neval
# dplyr  123.1894  124.1664  137.7076  127.3376  131.1523  445.8857   100
#     t  384.6002  390.2353  415.6141  394.8121  408.6669  787.2694   100
# apply 1425.0576 1520.7925 1646.0082 1599.1109 1734.3689 2196.5003   100

Edit: @David benchmark is more in line with OP so I suggest you consider his approach if you are to work with matrices.

edited Feb 03 '15 at 16:00

answered Feb 01 '15 at 23:50

Steven Beaupré

21,343
7
57
77

Steven, never met the notation with %>% before and googling didn't reveal any references. Could you point out some references to read? – Andres Kull Feb 02 '15 at 13:02
1

@AndresKull - `%>%` is the pipe operator (from package `magrittr`). You can read about it here: http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html – Steven Beaupré Feb 02 '15 at 13:08
1

MInd posting the code you used to generate that great graph? – Carl Witthoft Feb 02 '15 at 21:19
1

@CarlWitthoft There is an autoplot method for microbenchmark objects in `ggplot2`. To reproduce the above graph, you can simply do `ggplot2::autoplot(mbm)` – Steven Beaupré Feb 02 '15 at 21:36

Colonel Beauvel · Answer 2 · 2015-02-01T21:29:34.153

4

Without apply, a vectorized solution in one line:

t(t(B) / rep(rowSums(B), each=ncol(B)))
          [,1]      [,2]
[1,] 0.6666667 0.3333333
[2,] 0.4444444 0.5555556
[3,] 0.3000000 0.7000000

Or:

diag(1/rowSums(B)) %*% B

edited Feb 01 '15 at 21:29

answered Feb 01 '15 at 21:21

Colonel Beauvel

30,423
11
47
87

Very nice! I was going to dig up a non-looped,non-apply but yours is better. – Carl Witthoft Feb 01 '15 at 21:23
Excellent! The first one is 3x faster than the the version with apply proposed by @DavidArenburg. The second one is very slow with large matrix. – Andres Kull Feb 01 '15 at 21:55
1

Goliath sometimes win hopefully ;) – Colonel Beauvel Feb 01 '15 at 21:56

David Arenburg · Answer 3 · 2015-02-03T13:05:05.507

Actually I gave it a quick thought and the best vecotization would be simply

B/rowSums(B)
#           [,1]      [,2]
# [1,] 0.6666667 0.3333333
# [2,] 0.4444444 0.5555556
# [3,] 0.3000000 0.7000000

Actually @Stevens benchmark was misleading because OP has a matrix, while Steven benchmark on a data frame.

Here's a benchmark with a matrix. So for matrices, both vectorized solution will be better than dplyr which doesn't work with matrices

set.seed(123)
m <- matrix(sample(1e6), ncol = 100)

library(dplyr)
library(microbenchmark) 

Res <- microbenchmark(
  dplyr = as.data.frame(m) %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs),
  t = t(t(m) / rep(rowSums(m), each=ncol(m))),
  apply = t(apply(m, 1, prop.table)),
  DA = m/rowSums(m),
  times = 100
)

enter image description here

score 0 · Answer 4 · answered Feb 01 '15 at 21:30

I'm not sure that your function has any value, since you could just use the hist or density functions to accomplish the same result. Also, the use of apply would work as mentioned. But it serves as a reasonable programming example.

There are several inefficiencies in your code.

you use a for loop instead of vectorizing your code. This is very expensive. You should use apply as mentioned in the comments above.
You are using rbind instead of pre-allocating space for your output. This is extremely expensive as well.
```
out <- matrix(NA, nrow= n, ncol= ncol(B))
for (i in 1:nrow(B)) {
  out[i,] <- row_prob_dist(B[i,])
}
```

Alex, how would you use hist or density in this case? – Andres Kull Feb 01 '15 at 22:07 — Andres Kull, Feb 01 '15 at 22:07

How to convert frequency distribution to probability distribution in R

4 Answers4