1

I developed a script where I processed two large arrays (both of thousands of rows), "parent" and "product"

The starting dataset is something like this:

parent<-sample(1:10000,3500)
product<-sample(1:7500,2500)
mztol<-0.0015
mzdiff<-sample(1:1000,31)
names(mzdiff)<-c("d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
             "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", 
             "d19", "d20", "d21", "d22", "d23", "d24", "d25", 
             "d26","d27,"d28", "d29", "d30",
             "d31")

For first I applied the outer function in order to get a matrix of the differences between the two arrays element by element.

tabdiff<-outer(product,parent,'-')

Then I tried to subtract element by element the matrix tabdiff with a vector (mzdiff) in order to evaluate if there are elements <= a value (mztol). I did it by outer function.

subfun<-function(x,y) abs(x-y)<=mztol
vsubfun<-Vectorize(subfun)
vlogres<-outer(tabdiff,mzdiff,vsubfun)

Here I got a vector whose each element is a logical matrix. Then I converted it in a list:

listres<-alply(vlogres,3,.dims=T)

and put in evidence only the TRUE elements and counted them:

result<-sapply(listres, function(x) table(x)["TRUE"])

Well, the point is the script works fine if I elaborate only small parent and product vectors, like:

parent<-sample(1:1000,150)
product<-sample(1:1500,500)

If I consider large ones I got the error message "error: memory exhausted (limit reached?)" when it processes vlogres. Consider I have a 16 Gb RAM workstation. But it fails anyway.

So how can I optimize this script in order to avoid the error message? Any hint?

AeonRed
  • 77
  • 6
  • 2
    Would probably help if you provided a small reproducible example of what you're trying to do (not of replicating the error). – CPak Dec 06 '17 at 15:39
  • You're right. Just edited ;) – AeonRed Dec 06 '17 at 18:33
  • Hi. Your input is just two arrays. What do they represent? How are they constrained? What does names have to do with it? When is the result, given what is input? You appear to be expressing the solution in terms of huge intermediate relations that you don't need to be. PS Re tags don't you mean outer product, not outer join. – philipxy Dec 16 '17 at 04:52
  • Hi Philip. I'm dealing with chemical data (LC-MS/MS features). Parent is a vector concerning to "parent compounds", molecules which could undergo a fragmentation which could carry out one or more products. Every parent and product molecule has a feature known as m/z (the values you can find in parent and product vectors). So, I would like to figure out which kind of fragmentation takes place and hence understand if there is a link between a parent and a product compound. – AeonRed Dec 26 '17 at 17:15
  • By the difference of these two vectors (element by element) I have a matrix of values (tabdiff) which shows me some info about the fragmentation processes (a fragmentation implies the lost of m/z if a parent evolves in a product). Hence, every element of this matrix have to be compared with given values (mzdiff). If one of them matches to a mzdiff value I have an actual fragmentation. So, I would like to make easier and quicker the comparing of tabdiff with mzdiff. BTW I'd like to have an output where my results are reported by each mzdiff values. – AeonRed Dec 26 '17 at 17:32

1 Answers1

0

The easiest solution I can think of is to iterate through tabdiff as a 1d vector

tabdiff<-c(outer(product, parent, '-'))
result <- sapply(tabdiff, function(i) sum(abs(i-mzdiff) <= mztol))

The sapply statement should perform what you want but you should double check it. This eliminates the task of saving m * n * p sized data.


Another idea is to think of your problem backwards. You want product - parent values that are within a tolerance threshold (mztol) of each mzdiff. This means that you want product - parent values that are in the range of mzdiff +/- mztol. You can make vectors of upper- and lower-bound values for each mzdiff, and use dplyr::between or the faster data.table::inrange to find which values are in range.

CPak
  • 13,260
  • 3
  • 30
  • 48
  • Well, to be honest I need something more. Yeah, 1d vector makes everything easier, but at the same time I'm gonna lose every info about who's who. I mean, I need to track easily each element in order to understand which parent and product gives a match. So the n-dimensional format it's useful to me because its' a neat and clean way to save my data (for each mzdiff I have a parent vs. product logical matrix). Btw I will test the different approach you suggested me. Thanks. – AeonRed Dec 07 '17 at 16:44