I tried to recreate the correlation matrix that is described in scanpy's tutorial, using my own RNAseq dataset.
The relevant function in scanpy is: sc.pl.correlation_matrix
and the plot looks like this:
The main question here is: how was this Pearson's correlation between different cell types calculated, while the size of the matrix for each cell type is different?
For example: I have 1000 genes as columns, 500 CD34+ cells as rows and only 200 CD19+ B cells. So how is it possible to calculate a correlation between both cell types?
Additionally, calculating a correlation between two matrices, results in another matrix (and not with a single scalar such as displayed in the plot above). For example, numpy's corrcoef()
function that is applied on two matrices of the same size results in another matrix and not with a scalar...
I tried to average the genes expression over the cells for each cell type so the calculation contained equally-sized vectors, and it still didn't match scanpy's results.
I encountered this conversation: https://github.com/theislab/scanpy/pull/425 mentioning that when hierarchical clustering is computed, this correlation matrix above is created (but no code was provided).
I'll be happy with any suggestions, explanations and some possible python / code implementations.