Building a covariance matrix in Python

Question

Problem I want to implement an algorithm from an unpublished paper by my supervisor and as part of that, I need to construct a covariance matrix C using some rules given in the paper. I'm coming from Matlab and wanted to take this opportunity to finally learn Python, hence my question: How do I do this in the most efficient (fast) way in Python (including numpy,scipy)?

Subproblem 1:

Option 1: I use 2 for loops , looping over all rows and all columns. I would assume that's the worst thing to do.
Option 2: Using list comprehension, I construct a list of euclidean pairs and then iterate over that list. That's what I'm doing now.

Is there any better way?

Subproblem 2

Option 1: I iterate over all elements in the matrix.
Option 2: I iterate only over the lower triangular part (without diagonal), then add the transpose (because covariance matrices are symmetric) and then add the diagonal.

I'm fairly convinced subproblem 1 is a no-brainer but I don't know about subproblem 2. I should probably also say that the matrix I'm dealing with is probably 2*10^4 x 2*10^4.

Thanks!

Edit I prefer not to give the actual covariance matrix but since people want to have an example, let's say we want to construct the Covariance matrix of a stochastic process called 'Brownian bridge'. It's structure is given by:

cov(Xs, Xt) = min{s,t} − st

for let's say s,t ∈ {1,...,100}. How would you build it?

If you're learning python, jumping into this problem isn't the best idea. — Leb, Nov 05 '15 at 15:14
http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html — Lee, Nov 05 '15 at 15:15
@atomh33ls: I don't have data to estimate the covariance from. I am given certain rules on how to construct the covariance matrix. If you are familiar with Gaussian processes, the covariance matrix of Brownian motion for example has entry (i,j) take the value min(i,j). I have rules in that vain. — Ben, Nov 05 '15 at 15:21
It would be very helpful (if not necessary) to explain the procedure for calculating your covariance matrix. It might be very possible to not iterate explicitly at all, which would be the best/fastest & most 'pythonic' way. — DilithiumMatrix, Nov 05 '15 at 15:42
@Leb - This is actually the perfect type of problem to jump into scientific python with, i.m.o. For people coming from Matlab, jumping directly into numpy, etc really is what you want. — Joe Kington, Nov 05 '15 at 16:14
In many ways you should think of learning numpy and learning Python as being quite different things, since numpy has its own distinct set of best practices. The trick to getting good performance is to use as little Python as possible. That means avoiding low-level Python constructs such as `for` loops and list comprehensions by using clever array indexing and vectorized operations instead. By doing that you are effectively doing your looping in C rather than Python. — ali_m, Nov 05 '15 at 16:15
It's hard to recommend anything specific in your case without a concrete example, but to give you a flavour you could looking at [how numpy computes covariance](https://github.com/numpy/numpy/blob/bf28b4432183126c21f0fa80852c335e9c1ed7c1/numpy/lib/function_base.py#L2284-L2291). — ali_m, Nov 05 '15 at 16:17
It is easy enough to compute the covariance matrix in a mathematically correct way just using raw Python -- but doing so both quickly and with minimal round-off error is harder. It makes more sense to use something like `numpy` or `pandas` so you don't find yourself reinventing the wheel -- especially if the wheel that you invent is almost guaranteed to be suboptimal. — John Coleman, Nov 05 '15 at 16:24
@ali_m I have edited my original post and included an example of the kind of problem I face. — Ben, Nov 05 '15 at 16:51

score 11 · Accepted Answer · edited Nov 08 '15 at 17:56

First off, for others who may come across this question in the future: If you did have data and were wanting to estimate a covariance matrix, as several people have noted, use np.cov or something similar.

Building Arrays From Patterns

However, your question is about how to build a large matrix given some pre-defined rules. To clear up some confusion in the comments: Your question doesn't seem to be about estimating a covariance matrix, it's about specifying one. In other words, you're asking how to build up a large array given some pre-defined rules.

Which way is most efficient is going to depend on what you're doing in detail. Most performance tricks in this case will involve exploiting symmetry in the calculation you're preforming. (For example, is one row going to be identical?)

It's hard to say anything specific without knowing exactly what you're doing. Therefore, I'll focus on how to do this type of thing in general. (Note: I just noticed your edit. I'll include an example for a Brownian Bridge in just a bit...)

Constant (or simple) Row/Column

The most basic case is a constant row or column in the output array. It's easy to create the array and assign values to a column or row using slicing syntax:

import numpy as np

num_vars = 10**4
cov = np.zeros((num_vars, num_vars), dtype=float)

To set an entire column/row:

# Third column will be all 9's
cov[:,2] = 9

# Second row will be all 1's (will overwrite the 9 in col3)
cov[1,:] = 1

You can also assign arrays to columns/rows:

# 5th row will have random values
cov[4,:] = np.random.random(num_vars)

# 6th row will have a simple geometric sequence
cov[5,:] = np.arange(num_vars)**2

Stacking Arrays

In many cases, (but probably not this exact case) you'll want to build up your output from existing arrays. You can use vstack/hstack/column_stack/tile and many other similar functions for this.

A good example is if we're setting up a matrix for a linear inversion of a polynomial:

import numpy as np

num = 10
x = np.random.random(num) # Observation locations

# "Green's functions" for a second-order polynomial
# at our observed locations
A = np.column_stack([x**i for i in range(3)])

However, this will build up several temporary arrays (three, in this case). If we were working with at 10000-dimensional polynomial with 10^6 observations, the approach above would use too much RAM. Therefore, you might iterate over columns instead:

ndim = 2
A = np.zeros((x.size, ndim + 1), dtype=float)
for j in range(ndim + 1):
    A[:,j] = x**j

In most cases, don't worry about the temporary arrays. The colum_stack-based example is the right way to go unless you're working with relatively large arrays.

The most general approach

Without any more information, we can't exploit any sort of symmetry. The most general way is to just iterate through. Typically you'll want to avoid this approach, but sometimes it's unavoidable (especially if the calculation depends on a previous value).

Speed-wise this is identical to nested for loops, but it's easier (especially for >2D arrays) to use np.ndindex instead of multiple for loops:

import numpy as np

num_vars = 10**4
cov = np.zeros((num_vars, num_vars), dtype=float)
for i, j in np.ndindex(cov.shape):
    # Logic presumably in some function...
    cov[i, j] = calculate_value(i, j)

Vectoring Index-based Calculations

If many cases, you can vectorize index-based calculations. In other words, operate directly on arrays of the indices of your output.

Let's say we had code that looked like:

import numpy as np

cov = np.zeros((10, 10)), dtype=float)
for i, j in np.ndindex(cov.shape):
    cov[i,j] = i*j - i

We could replace that with:

i, j = np.mgrid[:10, :10]
cov = i*j - i

As another example, let's build up a 100 x 100 "inverted cone" of values:

# The complex numbers in "mgrid" give the number of increments
# mgrid[min:max:num*1j, min:max:num*1j] is similar to
# meshgrid(linspace(min, max, num), linspace(min, max, num))
y, x = np.mgrid[-5:5:100j, -5:5:100j]

# Our "inverted cone" is just the distance from 0
r = np.hypot(x, y)

Brownian Bridge

This is a good example of something that can be easily vectorized. If I'm reading your example correctly, you'd want something similar to:

import numpy as np

st = np.mgrid[1:101, 1:101]
s, t = st
cov = st.min(axis=0) - s * t

Overall, I've only touched on a few general patterns. However, hopefully this gets you pointed in the right direction.

That's what I was looking for. Thanks for the effort! – Ben Nov 08 '15 at 16:20 — Ben, Nov 08 '15 at 16:20