3

I am currently hearing a lecture about automatic speech recognition (ASR). The last lecture was about vector quantization (VQ) and k nearest neighbors (kNN) as well as binary trees and gaussian mixture models (GMMs).

According to the lecturer, VQ is used to speed up the evaluation of GMMs by just calculating an approximate value of the GMM. This is done by finding the gaussian in a GMM which would have the highest value and looking the value of this vector up (from a previously built dictionary, stored as a binary tree). Each GMM has about 42 gaussians. According to the lecturer, this should speed the calculation up, because the calculation of the e-function (exp, natural exponential function) is computationally expensive.

I was curious if this is (still) true, searched for the Python implementation and found this answer which explains that exp is calculated by the hardware.

Todays CPUs (and GPUs) are complex and I have very limited knowledge of them. It could still be true that exp is much more expensive than e.g. comparisons of floats, additions or multiplications.

Questions

  • How expensive is exp in comparison to float comparisons, additions, multiplications and similar basic commands?
  • Did I eventually understand something wrong why VQ is done in ASR?

Experimental evaluation

I tried to get a result by starting an experiment. But it is difficult for me to eliminate other effects from making my numbers wrong (e.g. caches, variable lookup times, time of random number generator, ...).

Currently, I have

#!/usr/bin/env python

import math
import time
import random

# Experiment settings
numbers = 5000000
seed = 0
repetitions = 10

# Experiment
random.seed(seed)
values = [random.uniform(-5, 5) for _ in range(numbers)]
v2 = [random.uniform(-5, 5) for _ in range(numbers)]

# Exp
for i in range(repetitions):
    t0 = time.time()
    ret = [math.exp(x) for x in values]
    t1 = time.time()
    time_delta = t1 - t0
    print("Exp time: %0.4fs (%0.4f per second)" % (time_delta, numbers/time_delta))

# Comparison
for i in range(repetitions):
    t0 = time.time()
    ret = [x+y for x, y in zip(values, v2)]
    t1 = time.time()
    time_delta = t1 - t0
    print("x+y time: %0.4fs (%0.4f per second)" % (time_delta, numbers/time_delta))

But I guess zip makes this one fail, because the result is:

Exp time: 1.3640s (3665573.5997 per second)
Exp time: 1.7404s (2872978.6149 per second)
Exp time: 1.5441s (3238178.6480 per second)
Exp time: 1.5161s (3297876.5227 per second)
Exp time: 1.9912s (2511009.5658 per second)
Exp time: 1.3086s (3820818.9478 per second)
Exp time: 1.4770s (3385254.5642 per second)
Exp time: 1.5179s (3294040.1828 per second)
Exp time: 1.3198s (3788392.1744 per second)
Exp time: 1.5752s (3174296.9903 per second)
x+y time: 9.1045s (549179.7651 per second)
x+y time: 2.2017s (2270981.5582 per second)
x+y time: 2.0781s (2406097.0233 per second)
x+y time: 2.1386s (2338005.6240 per second)
x+y time: 1.9963s (2504681.1570 per second)
x+y time: 2.1617s (2313042.3523 per second)
x+y time: 2.3166s (2158293.4313 per second)
x+y time: 2.2966s (2177155.9497 per second)
x+y time: 2.2939s (2179730.8867 per second)
x+y time: 2.3094s (2165055.9488 per second)
Community
  • 1
  • 1
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 1
    I do not know specifics about GMM or VQ, but it looks like your lecturer has done a good job if he/she made you check this out. Anyways, i found that for example in CUDA exp is implemented in the CUDA math tools library and not (directly) in hardware. Also, [this](http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDQQFjAB&url=http%3A%2F%2Fresearch.colfaxinternational.com%2Ffile.axd%3Ffile%3D2012%2F4%2FColfax_FLOPS.pdf&ei=QXB_VOakK8X1OPjegdAP&usg=AFQjCNEDFBEGCzNGF0qHtzSFEfNn18DOKA&bvm=bv.80642063,d.ZWU&cad=rja) states that Sandy Bridge CPUs also do not implement exp in hardware – Thomas Dec 03 '14 at 20:22
  • Your benchmark is dominated by Python interpreter overhead from `+` having 2 operands but `exp` only having 1. `exp` is significantly slower than floating point `+`. Also, Martijn's answer you linked is unfortunately wrong about `exp` being implemented directly in hardware; that's not the case on most CPU ISAs. Even legacy x87 (which some 32-bit x86 math libraries might or might not use) only has microcoded [`f2xm1`](https://www.felixcloutier.com/x86/f2xm1), not dedicated hardware on any real x86 CPUs. – Peter Cordes May 22 '20 at 23:31
  • Fast approximations to `exp` are very possible, though, thanks to binary floating point being based on `mantissa * 2^exponent`; you can stuff the integer part of the arg into the exponent field in a float (with integer bit-shifts) and do a quick polynomial approximation on the fractional part. Especially if you don't need full `float` precision this can be very fast, like a 3rd order polynomial is pretty good and only takes about 3 FMAs. – Peter Cordes May 22 '20 at 23:32

1 Answers1

3

According to the lecturer, VQ is used to speed up the evaluation of GMMs by just calculating an approximate value of the GMM. This is done by finding the gaussian in a GMM which would have the highest value and looking the value of this vector up (from a previously built dictionary, stored as a binary tree). Each GMM has about 42 gaussians.

This is a correct description. You can find an interesting description of an optimal gaussian computation in the following paper:

George Saon, Daniel Povey & Geoffrey Zweig, "Anatomy of an extremely fast LVCSR decoder," Interspeech 2005. http://www.danielpovey.com/files/eurospeech05_george_decoder.pdf

likelihood computation section

According to the lecturer, this should speed the calculation up, because the calculation of the e-function (exp, natural exponential function) is computationally expensive.

At this part you probably misunderstood the lecturer. The exp is not a very significant issue. The Gaussian computation is expensive for other reasons: there are several thousand Gaussian scored every frame each with a few dozen components each of 40 floats. It is expensive to process all this data due to the amount of memory you need to feed and store. Gaussian selection helps here to reduce the amount of Gaussian several folds and thus speeds up the computation.

Using a GPU is another solution to this problem. By moving scoring to GPU you can significantly speedup scoring. However, there is an issue with HMM search in that it can not be easily parallelized. This is another important part of the decoding and even if you reduce scoring to zero, the decoding will still be slow still due to the search.

Exp time: 1.5752s (3174296.9903 per second) x+y time: 9.1045s (549179.7651 per second)

This is not a meaningful comparison. There are many things you ignored here like the cost of the Python zip call (izip is better). This way you can demonstrate any result easily.

swimfar
  • 147
  • 6
Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87