2

I read the work titled 'Rethinking Attention with Performers'. This is a seminal contribution to handling the quadratic time complexity of self-attention used in Transformer with strong theoretical guarantees. However, I am stuck with the following equation (equation 5 on paper) to approximate the non-linear shift-invariant kernel.

refer to Equation 5

I am unable to prove the above equation, especially the deterministic function h(x) used in the equation. However, literature (cited in the paper and others) has the following form of Random Fourier Feature functions. This \phi function doesn't contain the deterministic function h(x).

Image

Find a few references for the above equation

It seems that equation 5 is a generalization of the above equation. However, I am unable to drive equation 5 from this equation. Kindly help me out to get equation 5 of 'Rethinking Attention with Performers'.

There is a blog for the work titled 'Random Features for Large-Scale Kernel Machines' for a better understanding of low-dimensional approximation of kernel function.

James Z
  • 12,209
  • 10
  • 24
  • 44

0 Answers0