Automate several calculations in R through data frames

Question

I have a series of vectors, each of them named as a stock, like FB for Facebook Inc. So I have over 70 series of vectors inside a data frames, for example, GEEK, IPAS, JCON etc. Over each pair of stocks, say for example, GEEK and JCON, I have to calculate a measure, called Mutual Information. I have done some code to find that measure over a pair of stocks, and it's like that.

To find entropyz (the entropy of X, Y, say the bivariate entropy of GEEK and JCON returns)

denz<-kde2d(x,y, n=512, lims=c(xlim,ylim))
z<-denz$z
cell_sizez<-(diff(xlim)/512) * (diff(ylim)/512)
normz<-sum(z)*cell_sizez
integrandz<-z*log(z)
entropyz<-sum(integrandz)*cell_sizez
entropyz<-entropyz/normz

To find entropyx (the entropy of X, say GEEK returns)

denx<-kde(x=x,gridsize = 512, xmin=xlim[1], xmax = xlim[2])
zx<-denx$estimate
cell_sizex<-(diff(xlim)/512) 
normx<-sum(zx)*cell_sizex
integrandx<-zx*log(zx)
entropyx<-sum(integrandx)*cell_sizex
entropyx<-entropyx/normx

To find entropyy (entropy of Y, say JCON returns)

deny<-kde(x=y,gridsize = 512, xmin=ylim[1], xmax = ylim[2])
zy<-deny$estimate
cell_sizey<-(diff(ylim)/512) 
normy<-sum(zy)*cell_sizey
integrandy<-zy*log(zy)
entropyy<-sum(integrandy)*cell_sizey
entropyy<-entropyy/normy

Finally, to find the mutual information of GEEK and JCON

MI <- entropyx+entropyy-entropyz

So, i have found the mutual information for X and Y (the two stocks above). But I have to calculate this measure for over 70 stocks (vectors), with 70 * 69 / 2 iteractions = 2415; It is like to make a correlation matrix, because it is pairwise comparison. The question is if one knows a way to make R find that mutual information for all pairs (x,y) in my dataset. So, in other words, to iterate this code for every pair over the dataframe, thus creating a pairwise matrix.

Thanks a lot!

As a starting point, one way to get every pairwise combination of a vector is `combn`. — lmo, May 25 '16 at 19:00
I have updated my answer, in order to make it better though I am intrigued as to how `xlim` and `ylim` are calculated, and whether they are different for each pair or not. — jamieRowen, May 25 '16 at 20:43
@jamieRowen These limits are just the range of x and y, that is the minimum and the maximum value of those time series. Thanks for your answer. — Alex Quintino Barbi, May 26 '16 at 23:26
@jamieRowen, I used the code below, and it didn't work. I have the rows named like this, a, b, c, d, e, (...) z (...) aa, ab (...) ak; So i'm using this function to calculated the mutual information for x and y. Thus, when calling the last part of your code (to apply the combinations), it returned me, 'Error in kde2d(x, y, n = 8, lims = c(xlim, ylim)) : data vectors must be the same length' ; but all my vectors have the same length; Do you know what's going on? thanks! — Alex Quintino Barbi, May 27 '16 at 04:11
@jamieRowen, did you get how 'xlim' and 'ylim' are both calculated? Thanks. — Alex Quintino Barbi, Sep 14 '16 at 20:41

jamieRowen · Accepted Answer · 2016-05-25T20:41:01.320

If you create a function MI that takes in your two vectors of data and returns the value you could use something like the following to generate a symmetric square matrix with the results in. If we assume your data is in a data frame df we could do

MI = function(x,y,xlim,ylim){
  denz<-kde2d(x,y, n=512, lims=c(xlim,ylim))
  z<-denz$z
  cell_sizez<-(diff(xlim)/512) * (diff(ylim)/512)
  normz<-sum(z)*cell_sizez
  integrandz<-z*log(z)
  entropyz<-sum(integrandz)*cell_sizez
  entropyz<-entropyz/normz

  denx<-kde(x=x,gridsize = 512, xmin=xlim[1], xmax = xlim[2])
  zx<-denx$estimate
  cell_sizex<-(diff(xlim)/512) 
  normx<-sum(zx)*cell_sizex
  integrandx<-zx*log(zx)
  entropyx<-sum(integrandx)*cell_sizex
  entropyx<-entropyx/normx

  deny<-kde(x=y,gridsize = 512, xmin=ylim[1], xmax = ylim[2])
  zy<-deny$estimate
  cell_sizey<-(diff(ylim)/512) 
  normy<-sum(zy)*cell_sizey
  integrandy<-zy*log(zy)
  entropyy<-sum(integrandy)*cell_sizey
  entropyy<-entropyy/normy

  return(entropyx+entropyy-entropyz)
}
df = data.frame(1:10,1:10,1:10,1:10,1:10)
matrix(
  apply(
    expand.grid(
      seq_along(df),seq_along(df)),1,
    FUN = function(i,j) MI(df[,i],df[,j],xlim,ylim)
    ),
  nrow = ncol(df)
)

this works because expand.grid gives you all the combinations of column indicies in a n^2 by 2 data frame. We then apply the MI function to each of those and store the result in a matrix.

Edit: Edited to make more clear

@AlexQuintinoBarbi What I mean is you could wrap all of your calculation of MI in a function that takes x and y since those are the things you pass to the various other functions. On rereading your question I see I miss read something first time, you say you have 70 data frames rather than 70 column of a data frame. But `kde2d` takes vectors of values so I made the assumption that each stock is a single vector of data, are your data frames each a single column? Can you stick them together in one data frame. Have edited answer to make more clear what I meant — jamieRowen, May 25 '16 at 20:36
I edited the question to become more clear about it. As I commented before, I'm getting two types of error, first one, 'Error in kde2d(x, y, n = 8, lims = c(xlim, ylim)) : data vectors must be the same length', and second one, Error in is.finite(x) : default method not implemented for type 'list'. I think there might be some problems with diferente ranges of x and y, for the first error. About the second one, I think I don't know, because I don't have lists. Thanks! — Alex Quintino Barbi, May 27 '16 at 16:04

Automate several calculations in R through data frames

1 Answers1