Problems with computing the joint probability mass function with np.histogram2d

Question

I currently have a 4024 by 10 array - where column 0 represent the 4024 different returns of stock 1, column 1 the 4024 returns of stock 2 and so on - for an assignment for my masters where I'm asked to compute the entropy and joint entropy of the different random variables (each random variable obviously being the stock returns). However, these entropy calculations both require the calculation of P(x) and P(x,y). So far I've managed to successfully compute the individual empirical probabilities using the following code:

def entropy(ret,t,T,a,n):

returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
asset_returns=returns_mat[:,a]
hist,bins=np.histogram(asset_returns,bins=n)
empirical_prob=hist/hist.sum()
entropy_vector=np.empty(len(empirical_prob))

for i in range(len(empirical_prob)):
    if empirical_prob[i]==0:
        entropy_vector[i]=0
    else:
        entropy_vector[i]=-empirical_prob[i]*np.log2(empirical_prob[i])

shannon_entropy=np.sum(entropy_vector)

return shannon_entropy, empirical_prob

P.S. ignore the whole entropy part of the code

As you can see I've simply done the 1d histogram and then divided each count by the total sum of the histogram results in order to find the individual probabilities. However, I'm really struggling with how to go about computing P(x,y) using

np.histogram2d()

Now, obviously P(x,y)=P(x)*P(y) if the random variables are independent, but in my case they are not, as these stocks belong to the same index, and therefore posses some positive correlation, i.e. they're dependent, so taking the product of the two individual probabilities does not hold. I've tried following the suggestions of my professor, where he said:

"We had discussed how to get the empirical pdf for a univariate distribution: one defines the bins and then counts simply how many observations are in the respective bin (relative to the total number of observations). For bivariate distributions you can do the same, but now you make 2-dimensional binning (check for example the histogram2 command in matlab)"

As you can see he's referring to the 2d histogram function of MATLAB, but I've decided to do this assignment on Python, and so far I've elaborated the following code:

def jointentropy(ret,t,T,a,b,n):

returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
assetA=returns_mat[:,a]
assetB=returns_mat[:,b]
hist,bins1,bins2=np.histogram2d(assetA,assetB,bins=n)

But I don't know what to do from here, because

np.histogram2d()

returns a 4025 by 4025 array as well as the two separate bins, so I don't know what I can do to compute P(x,y) for my two dependent random variables.

I've tried to figure this out for hours without any luck or success, so any kind of help would be highly appreciated! Thank you very much in advance!

score 0 · Accepted Answer · answered Mar 14 '17 at 16:54

0

Looks like you've got a clear case of conditional or Bayesian probability on your hands. You can look it up, for example, here, http://www.mathgoodies.com/lessons/vol6/dependent_events.html, which gives the probability of both events occurring as P(x,y) = P(x) · P(x|y), where P(x|y) is "probability of event x given y". This should apply in your situation because, if two stocks are from the same index, one price cannot happen without the other. Just build two separate bins like you did for one and calculate probabilities as above.

answered Mar 14 '17 at 16:54

postoronnim

486
2
10
20

Don't worry, I think I managed to get it working by doing hist, binsx, binsy= np.histogram2d(assetA, assetB, [n,n]), and then doing joint_probs=hist/hist.sum(), and I now obtain a 4024 by 4024 joint probability table, which looks more than fine. Thanks for your response anyway! – Jayjay95 Mar 14 '17 at 17:09
Awesome - one caveat, though: joint probability table has (O) N^2 space complexity, whereas two separate bins are linear. – postoronnim Mar 14 '17 at 17:20
So do you suggest I should change it to just bins=n? – Jayjay95 Mar 14 '17 at 18:56
Just try it out: if it works, it would be the faster of the two. – postoronnim Mar 14 '17 at 18:59
I get the same results for the joint entropy with just bins=n than with bins=[n,n], so it seemed to work perfectly fine. Thanks a lot! – Jayjay95 Mar 14 '17 at 18:59

Problems with computing the joint probability mass function with np.histogram2d

1 Answers1