R snowfall : parallel apply on table columns

Question

I have a table M with many columns and rows, obtained from a text file :

M <- read.table("text.csv",header=TRUE,sep="\t")

To obtain the ranks by columns I successfully used :

M <- apply(M,2,rank)

I would like to speed up the computation but I did not succeed to implement this function in snowfall.

I tried :

library(snowfall)
sfStop()
nb.cpus <- 8
sfInit(parallel=TRUE, cpus=nb.cpus, type = "SOCK")
M <- sfClusterApplyLB(M, rank) # does not work
M <- sfClusterApply(M,2,rank) # does not work
M <- sfClusterApplyLB(1:8, rank,M) # does not work

What is the equivalent of M <- apply(M,2,rank) in snowfall ?

Thanks in advance for your help !

The second argument to "sfClusterApply" must be a function. It doesn't take a "margin" argument. — Steve Weston, Feb 15 '16 at 15:37

score 1 · Accepted Answer · answered Feb 14 '16 at 01:43

The equivalent of apply in snowfall is sfApply. Here's an example:

library(snowfall)
sfInit(parallel=TRUE, cpus=4, type="SOCK")
M <- data.frame(matrix(rnorm(40000000), 2000000, 20))
r <- sfApply(M, 2, rank)
sfStop()

This example runs almost twice as fast as the sequential version on my Linux machine using four cores. That's not too bad considering that rank isn't very computationally intensive.

score 0 · Answer 2 · answered Feb 11 '16 at 00:14

0

Here is a working example:

rank_M_df_col_fx=function(i){
  #M<- read.table("text.csv",header=TRUE,sep="\t")
  col_rank=rank(M[,i])
  return(col_rank)
}

M=data.frame(replicate(10,sample(0:100,1000,rep=TRUE)))
n_cols=ncol(M)

library(snowfall)
sfInit(parallel=TRUE) # 
sfExportAll()
rank_results_list=sfLapply(x=c(1:n_cols), fun=rank_M_df_col_fx)
rank_dataframe <- data.frame(matrix(unlist(rank_results_list), nrow=nrow(M), byrow=F))

sfRemoveAll()
sfStop()

However, having shown how to do it, this is a type of fast operation that parallelizing will likely not give substantially faster results, given the overhead of starting the instances, etc.

answered Feb 11 '16 at 00:14

Lucas Fortini

2,420
15
26

Thank you very much for this very good answer ! I did some tests and, as you pointed out, the parallel code is not faster at least on my example. – Fred Feb 11 '16 at 11:12
No problem! If my response answers your question, you should check it as the correct answer... – Lucas Fortini Feb 12 '16 at 02:01

score 0 · Answer 3 · answered Feb 15 '16 at 10:26

0

Thank you very much for your help !

I finally combined the solution of Lucas and Steve to obtain the ideal solution for my problem.

I think that my code was not working with M <- sfClusterApply(M,2,rank) because sfExportAll() was missing.

So finally the simplest solution working for me is :

M <- read.table("text.csv",header=TRUE,sep="\t")
n_cols=ncol(M)
nb.cpus <- 4
library(snowfall)
sfStop()
sfInit(parallel=TRUE, cpus=nb.cpus, type = "SOCK") 
sfExportAll()
M <- sfApply(M,2,rank)
sfRemoveAll()
sfStop()

answered Feb 15 '16 at 10:26

Fred

3
2

The "rank" function doesn't need any data from the global environment to work correctly, whereas the "rank_M_df_col_fx" function in Lucas's answer does. Using "sfExportAll" in your answer only wastes time creating global variables on the workers that won't be used. The reason "sfClusterApply" didn't work for you is because it is equivalent to "lapply", not "apply". – Steve Weston Feb 17 '16 at 00:34
Thank you for this comment Lucas. That's true that in the simple example of rank "sfExportAll" is useless. In another more complex computation I needed "sfExportAll". – Fred Feb 18 '16 at 15:41

R snowfall : parallel apply on table columns

3 Answers3