This may not be the answer you're looking for, but one way to speed it up considerably is to use a gpu instead of your cpu. If you have a decently powerful graphics card around, it'll outperform your cpu any day, even if your system is very well tuned.
For nice integration with numpy, you could use theano (if your graphics card is made by nvidia). The calculation in the following code runs for me in couple of seconds (although I have a very powerful graphics card):
$ THEANO_FLAGS=device=gpu0 python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
Using gpu device 0: GeForce GTX 480
>>> from theano import tensor as T
>>> import numpy
>>> x = numpy.ones((200000, 1000), dtype=numpy.float32)
>>> m = T.matrix()
>>> mTm = T.dot(m.T, m)
>>> f = theano.function([m], mTm)
>>> f(x)
array([[ 200000., 200000., 200000., ..., 200000., 200000., 200000.],
[ 200000., 200000., 200000., ..., 200000., 200000., 200000.],
[ 200000., 200000., 200000., ..., 200000., 200000., 200000.],
...,
[ 200000., 200000., 200000., ..., 200000., 200000., 200000.],
[ 200000., 200000., 200000., ..., 200000., 200000., 200000.],
[ 200000., 200000., 200000., ..., 200000., 200000., 200000.]], dtype=float32)
>>> r = f(x)
>>> r.shape
(1000, 1000)
I was going to wait to find out how long >>> numpy.dot(x.T, x)
took by way of comparison, but I got bored...
You can also try PyCuda or PyOpenCL (if you don't have an nvidia graphics card), although I don't know if their numpy support is as straightforward.