IMO, there are several wrong assumptions in the question, but it is interesting anyway.
The calculation of theoretical RAM speed proposed in the question seems to forget multi-channel architectures. I would use the following formula:
Max transfer rate = clock frequency * transfers per clock * interface width * number of interfaces
to be divided by 8 to get the results in bytes/s
In your example, clock frequency = 667 MHz, transfers per clock = 2 (because it is DDR-1333 memory), interface width = 64 bits, and the number of interfaces depends on your motherboard and the number of plugged memory modules. Most recent PCs provide 2 channels. Recent servers provide 3 or 4 channels. The number of interfaces is min(number of modules per CPU, number of channels).
Some information about the burst rate of the DD3 memory:
http://en.wikipedia.org/wiki/DDR3_SDRAM
Now, you have to keep in mind that this bandwidth corresponds to a theoretical burst rate, generally only sustainable for brief periods of time. Furthermore, it only qualifies the memory module capabilities, it means nothing for the front side bus and the CPU memory controllers. In other words, even with very fast memory modules, a slow CPU may not be able to saturate the memory bandwidth. Bottlenecks are not always in the memory modules.
On ccNUMA machines (most servers with 2 or 4 sockets), if a CPU core needs to access a block located on a memory bank attached to another CPUs, the interconnection bus (QPI or hypertransport) will be used. This bus can also be a bottleneck.
Finally, I think the methodology of the test (using dd) is flawed, because:
It does not exercise only memory transfers, because dd uses the filesystem interface. Even assuming that the resulting file is hosted in a memory filesystem (such as tmpfs or /dev/shm), dd will make system calls to perform the operation, which brings additional costs.
dd is a single-threaded process. One single core may not be enough to saturate the whole memory bandwidth. On a server with multiple sockets, this is 100% guaranteed. On a single socket system, I guess it depends on the CPU itself.
If you really want to evaluate the actual memory bandwidth and compare it to the theoretical limit, I would suggest to use a benchmark program designed for this purpose. For instance the STREAM benchmark is often used to measure the sustainable memory bandwidth.