Eigenvalues of sample covariance far from eigenvalues of covariance

Question

I wrote the following code to create a random matrix Sigma with specified eigenvalues, and then sample vectors from multivariate normal distribution with zero mean and covariance Sigma.

def generate_data(N, d):
    eigenvalues = [0] * (d + 1)
    for k in range(1, d + 2):
        eigenvalues[k - 1] = k

    random_matrix = numpy.random.randn(d + 1, d + 1)
    random_orthogonal = numpy.linalg.qr(random_matrix)[0]
    sqrt_cov = random_orthogonal @ numpy.diag(numpy.sqrt(eigenvalues))

    X = numpy.zeros((N, d + 1))
    for i in range(N):
        vec = numpy.random.standard_normal(d + 1)
        X[i] = sqrt_cov @ vec

After this code, X should be N by d+1 matrix that's been sampled from the desired distribution.

Now I want to know what are the eigenvalues of the sample covariance matrix of X. If I am not mistaken, it should be similar to Sigma

def get_sample_covariance(data):
    data = data - data.mean(axis=0)
    sample_cov = data.T @ data / (data.shape[0] - 1)
    return sample_cov

I then plotted the eigenvalues of sample_cov

I expected a roughly linear function, going from d (which was 500) to 1. I got this

What gives? Where's the mistake?

What was N is your example? – tstanisl Jun 13 '20 at 20:48 — tstanisl, Jun 13 '20 at 20:48

score 0 · Answer 1 · answered Jun 14 '20 at 14:12

The generation of samples appears to be correct. But the estimate for covariance from N random samples is only correct if N >> d. In particular the highest eigenvalue tends to be systematically off. However, if you take the mean eigenvalue out of all eigenvalues, it's quite accurate.

def get_max_mean_eig_run(N, d):
    """Return highest and mean eigenvalue for random samples"""

    X = generate_data(N, d)
    cov = get_sample_covariance(X)
    eig = np.linalg.eigvals(cov)
    return eig.max(), eig.mean()

plt.close('all')
np.random.seed(1)
ds = np.arange(1, 501, 25)

# shape (N, 2) for max and mean eigenvalues
eigs = np.array([get_max_mean_eig_run(1000, d) for d in ds])

plt.close('all')
plt.xlabel('d')
plt.ylabel('Eigenvalue')
plt.plot(ds, eigs[:, 0], 'r-', label='Max')
plt.plot(ds, eigs[:, 1], 'r--', label='Mean')
plt.scatter(ds, ds+1, color='k', marker='o', label='max expected')
plt.scatter(ds, (ds+2)/2, color='k', marker='*', label='mean expected')
plt.legend()

For this particular input spectrum of eigenvalues, you need N > 100*d to get the maximum eigenvalue close (+5%) to the maximum input eigenvalue, but this will likely be different for more realistic cases.

Here is a histogram of the eigenvalues (for N=1000, d=500)

plt.close('all')
X = generate_data(1000, 500)
eigs = np.linalg.eigvals(get_sample_covariance(X))
hist, bin_edges = np.histogram(eigs, bins=20)
bin_centers = (bin_edges[1:] + bin_edges[:-1])/2
plt.step(bin_centers, hist, where='mid')
plt.xlabel('Eigenvalue')
plt.ylabel('Count (per bin)')

Eigenvalues of sample covariance far from eigenvalues of covariance

1 Answers1