Approximation Softmax Kernel using Random Fourier Features & Performers

Asked May 02 '22 at 02:59

Active May 02 '22 at 04:40

Viewed 108 times

I read the work titled 'Rethinking Attention with Performers'. This is a seminal contribution to handling the quadratic time complexity of self-attention used in Transformer with strong theoretical guarantees. However, I am stuck with the following equation (equation 5 on paper) to approximate the non-linear shift-invariant kernel.

I am unable to prove the above equation, especially the deterministic function h(x) used in the equation. However, literature (cited in the paper and others) has the following form of Random Fourier Feature functions. This \phi function doesn't contain the deterministic function h(x).

Find a few references for the above equation

It seems that equation 5 is a generalization of the above equation. However, I am unable to drive equation 5 from this equation. Kindly help me out to get equation 5 of 'Rethinking Attention with Performers'.

There is a blog for the work titled 'Random Features for Large-Scale Kernel Machines' for a better understanding of low-dimensional approximation of kernel function.

edited May 02 '22 at 04:40

James Z

12,209
10
24
44

asked May 02 '22 at 02:59

Pravin Nagar

Approximation Softmax Kernel using Random Fourier Features & Performers

0 Answers0