1

In this Distill article (https://distill.pub/2017/feature-visualization/) in footnote 8 authors write:

The Fourier transforms decorrelates spatially, but a correlation will still exist 
between colors. To address this, we explicitly measure the correlation between colors
 in the training set and use a Cholesky decomposition to decorrelate them.

I have trouble understanding how to do that. I understand that for an arbitrary image I can calculate a correlation matrix by interpreting the image's shape as [channels, width*height] instead of [channels, height, width]. But how to take the whole dataset into account? It can be averaged over, but that doesn't have anything to do with Cholesky decomposition.

Inspecting the code confuses me even more (https://github.com/tensorflow/lucid/blob/master/lucid/optvis/param/color.py#L24). There's no code for calculating correlations, but there's a hard-coded version of the matrix (and the decorrelation happens by matrix multiplication with this matrix). The matrix is named color_correlation_svd_sqrt, which has svd inside of it, and SVD wasn't mentioned anywhere else. Also the matrix there is non-triangular, which means that it hasn't come from the Cholesky decomposition.

Clarifications on any points I've mentioned would be greatly appreciated.

  • To my dilettantish view, seems like sort of whitening of the data. Cholesky whitening is a one way to go, others are PCA, ZCA whitening, etc. In this case SVD seems to be used to compute matrix which whitens the data, which they just saved and apply on the new inputs. After that my memories of math become to fade – Slowpoke Jun 09 '20 at 22:05
  • @Slowpoke, ah, and then the Cholesky decomposition (or SVD) would be calculated out of the whole dataset in the first place? (so no averaging would be needed in this case) – Alexander Chebykin Jun 10 '20 at 18:41
  • What I can suppose about the meaning of this code. If we look at the [types of whitening](https://en.wikipedia.org/wiki/Whitening_transformation), most probable variant for us is ZCA whitening. Here whitening matrix W is covariance matrix in the power -1/2 (they wrote "correlation and sqrt". SVD refers most probably to the way the power was computed. Another way is eigendecomposition). To whiten the data, they multiply from the right side by W.T which is also OK. Why they normalize by dividing by `max_norm_svd_sqrt` is a mystery. The same for means - they are present, but not subtracted – Slowpoke Jun 10 '20 at 19:08
  • The whitening matrix is calculated from the whole dataset, yes. Whitening is also performed on the datasets with Nrows >> Ncols, whitening matrix is NcolsxNcols – Slowpoke Jun 10 '20 at 19:09
  • You can compute covariance matrix using cov matrices of single images or batches of images from the dataset. Just use the same number of rows in each batch and average those matrices afterwards (sample covariance is an estimator of those of dataset), perhaps you will need to use float64 precision for that. Mean is also usually subtracted from the data for this, as far, as I remember. – Slowpoke Jun 10 '20 at 19:21
  • Thanks for the pointers; now I think what happens is this: in the code they write they want to go FROM decorrelated space to a normal one. And a way to do that would be to multiply by sqrt(cov) [note that they call the variable "correlation and sqrt", which is cov^0.5, not cov^(-0.5)]. Now, this sqrt(cov) can be produced either by Cholesky decomposition (as they write in the article), or by SVD (which they mention in code), or by taking matrix sqrt of covariance. – Alexander Chebykin Jun 12 '20 at 17:51
  • From this follows that if I multiply the hard-coded matrix with itself, I should get the covariance. And I do get the same covariance as the one I calculate empirically, up to the multiplication by the constant. But this is not the same constant as the one by which they normalize, so I still have no idea why they normalize the matrix in such a strange way. Maybe it's connected to the fact that what they store is "covariance^0.5", not "correlation^0.5", and they do name it "correlation". But I don't think that dividing covariance by a max l2-norm would give correlation – Alexander Chebykin Jun 12 '20 at 17:53
  • Division by maximal norm of one of the columns seems to be some authors' alchemy, performed for their needs. Possible reason is to normalize input data to -1..1 for entering neural network. Effect on PCA is just scalar multiplier. They do that for input image and they need to do it to reverse transform. Covariance vs correlation is a very frequent mistake in naming (they also call reverse transform "decorelate"). Those are just my thoughts, I didn't have much time to dig this code. Good thing is that adding means is present, it found it below in `to_valid_rgb()` function. – Slowpoke Jun 15 '20 at 10:40

1 Answers1

1

I figured out the answer to your question here: How to calculate the 3x3 covariance matrix for RGB values across an image dataset?

In short, you calculate the RGB covariance matrix for the image dataset and then do the following calculations

U,S,V = torch.svd(dataset_rgb_cov_matrix)
epsilon = 1e-10
svd_sqrt = U @ torch.diag(torch.sqrt(S + epsilon))
ProGamerGov
  • 870
  • 1
  • 10
  • 23
  • 1
    Yep, that's almost exactly what I ended up with (I had an additional "@ U.T" at the end of the last line, which makes it a ZCA transform, but it's probably irrelevant). Thanks for writing it up as an answer – Alexander Chebykin Feb 17 '21 at 10:36