I am trying to implement AES-256 in CTR mode using nVidia CUDA. I have successfully coded CPU code for key expansion and now I need to implement the actual AES-256 algorithm. According to Wikipedia, some codes I've seen and particularly this PDF (page 9), AES rounds can be implemented as series of table lookups. My question is how do I generate these tables? I am aware that I need 4 KB to store these tables, and that is not a problem. I have spent whole day trying to find these tables with no success. The PDF I posted a link to mentions lookup tables T0, T1, T2 and T3, but I do not know what these are. It also mentions round keys 4, 5, 6 and 7, but I also do not understand what these indices are referring to.
The closest I have come to figuring out how to generate these lookup tables is from this project. Inside the code, there is a comment that says:
Te0[x] = S [x].[02, 01, 01, 03];
Te1[x] = S [x].[03, 02, 01, 01];
Te2[x] = S [x].[01, 03, 02, 01];
Te3[x] = S [x].[01, 01, 03, 02];
However, I'm not entirely sure I know what that notation means (is it a matrix multiplication or something else?). The only thing I recognize is the mix-column part constant matrix, as well as the S-box matrix.
[Edit] Now that someone pointed it out - how can a lookup implementation be actually slower? Would it be wise to implement AES without lookup tables here?