Questions tagged [self-attention]

57 questions
0
votes
0 answers

Tensorflow Attention ValueError: Dimension must be 5 but is 4

I am trying to follow the below code for a self-attention model. The self-attention networks have 16 heads, and the output of each head is 16-dimensional. The dimension of the additive attention query vectors is 200. def __init__(self, nb_head,…
0
votes
0 answers

Using Self-Attention Layer in Keras without Encoder-Decoder

Attention has been used with encoder decoder to my knowledge. I am trying to use it as a layer in a feedforward neural network. I have the following archeticturE: Input layer -> Dense Layer -> Self-Attention Layer -> Dense Layer -> SoftMax…
Avv
  • 429
  • 4
  • 17
0
votes
0 answers

Creating attention mask for transformer block with (batch size, sequence length, spatial samples, embed dim) as input

I am trying to use a transformer to analyze some spatio-temporal data. I have an array of training data with dimensions "batch size x sequence length x spatial samples x embedding dimension." In order to prevent the transformer from cheating while…
0
votes
0 answers

Positional Embedding in Transformers - Time Series Data

I'm adding Multi-Headed attention at the input of my CNN to improve interpretability and explainability of my model. The data is formed as time-series 3D input of shape (125, 5, 6) where 5x6 part represents the data in a single sample and 125…
0
votes
1 answer

TypeError: call() got an unexpected keyword argument 'use_causal_mask' ---> getting this error on flickr8k/flickr30k dataset

Error TypeError Traceback (most recent call last) /tmp/ipykernel_23/1382744270.py in 2 image_path = tf.keras.utils.get_file('surf.jpg', origin=image_url) 3 image = load_image(image_path) ----> 4…
0
votes
0 answers

What makes the multi-head self-attention matrices different?

Transformers (BERT) use one set of three matrices, Q, K, V for each attention head. BERT uses 12 attention heads in each layer, with each attention head having it's own set of three such matrices. The actual values of these 36 matrices are obtained…
0
votes
0 answers

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the…
tusker
  • 57
  • 5
0
votes
0 answers

Implementing Shaw's Relative Attention using Tensorflow

Is there a straightforward way to implement relative positional encoding as described in the Shaw paper using Tensorflow instead of absolute positional encoding? Thanks!
0
votes
0 answers

Why can the key dimension be different than the input sequence length for self attention on a time series?

In Timeseries classification with a Transformer model, the author builds a transformer encoder like this def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0): x = layers.LayerNormalization(epsilon=1e-6)(inputs) x =…
aez
  • 2,406
  • 2
  • 26
  • 46
0
votes
0 answers

Simple self-Attention API to learn from vector sequence

I wanted to implement simple softmax-based self-attention for a sequence of vectors. Using PyTorch's multi-head self-attention API seems overwhelming for my task with a large number of parameters to train. Is there any API/ simple codebase to do…
0
votes
0 answers

how to check transformer model's attention?

I am doing a project on text summarization. I am also showing the attention mask for training. model is working well but I want to check and show those words on which during training model is giving attention for prediction long to short summary.
0
votes
1 answer

How to generate vision transformer attention maps for 3D grayscale MRI data

How can I generate attention maps for 3D grayscale MRI data after training with vision transformer for a classification problem? My data shape is (120,120,120) and the model is 3D ViT. For example: img = nib.load() img = torch.from_numpy(img) model…
0
votes
0 answers

sparse attention and its relation with attention mask

Can anyone please explain in a clear way what is the usage of mask in attention for sparse attention? I just can not get how masking tokens (I do not mean here pad tokens) can make attention faster as example as mentioned in sparse attention…
Arij Aladel
  • 356
  • 1
  • 3
  • 10
0
votes
0 answers

How to get inter stock relationship using Deep Learning?

i'm trying to get relationship between stock companies based on their historical closing prices. Cross-correlation or other similarity matrices can perform this task. But i want use deep learning methods(RNN/attention) to extract relationship…
0
votes
0 answers

Confusion regarding num_heads & key_dim keras.layers.MultiHeadAttention in the transformer tutorial

In the tf.keras tutorial: https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/transformer.ipynb, class EncoderLayer(tf.keras.layers.Layer): def __init__(self,*, d_model, # Input/output…
kawingkelvin
  • 3,649
  • 2
  • 30
  • 50