0

I'm doing a discrete wavelet transform using the following code in R. My data in data.table format is fetched from a hive table in chunks and converted into a matrix and then the wavelet transform is applied as below.

library(parallel)
library(wavelets)

# Function to create discrete wavelet transform from data in matrix format
    createWt  <- function(d_matrix){

      wtScore <- NULL
      for (i in 1:nrow(d_matrix)){
        a <- d_matrix[i,]
        wt <- dwt(a, filter= "haar" , boundary = "periodic" )
        wtScore <- rbind(wtScore, unlist(c(wt@W,wt@V[[wt@level]])))
      }

      return(wtScore)
    }

# applying the function to a list of matrices parallely using mclapply
    wtScore <- parallel::mclapply(m_score, createWt, mc.cores = 28)

The discrete wavelet transform takes about 3 hrs, 30 minutes for a dataset of 10 million x 52 columns on a LINUX machine with 32 cores where I'm using 28 cores for my processing as noted above. But I have to do this on a dataset of 30-35 million rows X 52 columns and it takes about 26 hrs to run for a dataset of 30 million rows. m_score above is a list of chunked matrices converted from the data.table.

Any ideas on doing it faster in R :

  • looking for any specialized libraries in R or other languages.
  • since the data is coming from a hive table, I'm also open to doing the transform using a Hive UDF, but couldn't find a UDF for wavelet transform.
ML_Passion
  • 1,031
  • 3
  • 15
  • 33
  • 1
    Not sure how much gain you will get out of this, but switch `wtScore <- NULL` to `wtScore <- NULL` to `wtScore <- matrix(0L, nrow(d_matrix), ncol=)` and then fill in the rows of the matrix rather than using `rbind`. Growing objects in a loop is a bad idea, and terrible with long loops, as it involves many copies. – lmo Apr 04 '18 at 23:05
  • Good suggestion, here is what I did. I changed the final object to data.table, which is even faster and doesn't create copies. ```createWt <- function(mat){ wtScore <- data.table::data.table(NULL) for (i in 1:nrow(mat)){ a <- mat[i,] wt <- dwt(a, filter= "haar" , boundary = "periodic" ) m1 <- matrix(unlist(c(wt@W,wt@V[[wt@level]])), ncol=50, byrow=TRUE) wtScore <- rbind(wtScore, data.table::data.table(m1, stringsAsFactors=FALSE)) } return(wtScore) }``` – ML_Passion Apr 06 '18 at 13:58
  • I'm still looking for a better solution. I'm working on a solution which involves feeding the data in more chunks through hive, but I'll let the community know if it works out. – ML_Passion Apr 06 '18 at 14:01

0 Answers0