Is it possible to mask only part of an embedding during masked 'language' modeling?

Question

I'm using Transformers to process time-series data. Each X second time window of data (from S sensors) is embedded into F features before being inputted to the Transformer. Each F/S span of the embedding corresponds to features from one sensor's data. The training objective is very similar to masked language modeling for NLP: during training, 25% of embeddings from each sequence are masked by replacing with a learned MASK embedding. The masked embeddings (e.g. at positions 1 and 4 in sequence) are compared with embeddings (also at positions 1 and 4) generated from the same input sequence when no masking occurs.

The idea with the masked modeling objective is to enable the Transformer to draw on surrounding context to produce similar embeddings even when the current embedding is masked. For my purpose, X cannot be decreased. However, the time window is on the longer side and too much information loss may be happening when an embedding is masked. I am now considering a masking approach where only part of the 'masked' embeddings (and by 'part' I mean all data from a subset of the sensors) are actually masked.

Analogy to CV task: consider trying to train a model to embed images of animals, with each image being embedded as a sequence of smaller image patches. You use a masked modeling objective so the model learns to draw on information from elsewhere in the image to embed masked patches. The image patches are large enough that you risk masking out most of the head and body of the animal, depending on which patches you have randomly chosen to mask out. Since data constraints prohibit you from just reducing the size of the patch, you choose to only mask part of each patch at random. Now, the model has a little more information about the masked patches and learns to denoise using this limited information as well as the context.

What I'm wondering:

Is this a feasible approach/has it been tried? With this kind of objective, the Transformer is usually drawing on surrounding context, not the current embedding itself. I used the nlp, language-model tags since the training objective originated in the NLP domain.
Could it make sense to continue using a learned MASK embedding in this case (since sensor-specific data will be dropped at random, meaning specific masked indices will change), or would it be necessary to set masked indices to zero? With a learned MASK embedding, since each section of my input features are sensor-specific (e.g. with 20 features and 4 sensors, the first 5 features come from the first sensor), I am considering learning an (F/S)-length MASK embedding and replacing masked sensor inputs with the shorter embedding (e.g. if masking sensors 1 and 4, replace first 5 and last 5 features with our MASK embedding).

Is it possible to mask only part of an embedding during masked 'language' modeling?

0 Answers0