How to generate multiple patches of a single string in BERT model

Question

I am using a BERT model for generating text embeddings. My strings are like There is pneumonia detected in the left corner. When I encode() and pass a batch of 20 strings and I print the model output, it returns [20 256], where 20 is the batch size and 256 is the size of each output vector. It means that it generates texts in the form of vectors/tensors each with a size of 256 [1 256].

def create_text_encoder(
   num_projection_layers, projection_dims, dropout_rate, trainable=False):
   
   # Load the BERT preprocessing module.
   preprocess = hub.KerasLayer(
   "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/2", name="text_preprocessing",)
   
   # Load the pre-trained BERT model to be used as the base encoder.
   bert = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1", name="bert",)
  
   # Set the trainability of the base encoder.
   bert.trainable = trainable
  
   # Receive the text as inputs.
   inputs = layers.Input(shape=(), dtype=tf.string, name="text_input")
  
   # Preprocess the text.
   bert_inputs = preprocess(inputs)
  
   # Generate embeddings for the preprocessed text using the BERT model.
   embeddings = bert(bert_inputs)["pooled_output"]
   # Project the embeddings produced by the model.
   outputs = project_embeddings(
    embeddings, num_projection_layers, projection_dims, dropout_rate)
  
   # Create the text encoder model.
   return keras.Model(inputs, outputs, name="text_encoder")

Now I want to divide each string into 5 patches after feeding this There is pneumonia detected in the left corner single string to my above model. At first, the model was generating an embedding size of [1 256] for a single string, now it will generate [5 256] for a single text. Five vectors for single text each with a shape of 256.

Is it possible? Have someone done it before?

it is still not clear, and with the introduction of 'image' is actually even more confusing — Marat, Jul 29 '22 at 17:04
Sorry I have updated the mistake. look my input is this encoded string `There is pneumonia detected in the left corner` and output is a vector shape of `[1 256]`. Can I generate parts as output of this string like 5 parts so the input will be the same but output shape will be `[5 256]` and not `[1 256]`. — Jacob, Jul 29 '22 at 17:18
What are the criteria for this? E.g. you could simply tile the embeddings to get the right shape, but you probably want to do something more meaningful. What exactly is that you're trying to get here? — Marat, Jul 29 '22 at 18:19
Basically my idea is to build a clip model where I use a vision encoder and text encoder. I have splitted the images in 5 patches but I don't know about text. I want to split this `There is pneumonia detected in the left corner` in separate words [`There is`, `pneumonia`, `detected in`,`the`,`left corner`] and then generate 5 vectors as they are 5 strings. — Jacob, Jul 29 '22 at 19:44
I don't know if making patches of a string this way would devide my string in separate parts like this [`There is`, `pneumonia`, `detected in`, `the`, `left corner`]. I don't care about the order if it's like this [`There`, `is pneumonia`, `detected `, `in the left`, `corner`] but the model should generate a matrice [5 256] and not a single vector [1 256] — Jacob, Jul 29 '22 at 19:48
I see. I don't think you'll get an answer here, for two reasons. First, it is not a programming question - it'd be more appropriate to ask [here](https://datascience.stackexchange.com/) ([meta post](https://meta.stackexchange.com/questions/130524/)). Second, you don't seem to have a developed idea to pursue or sufficient background to make it. Having a technically correct answer won't help in this case — Marat, Jul 29 '22 at 20:49
If there is way a to divide an image into patches, I thought there would be in natural language models too. This is a keras built-in model and here are very good keras/tensorflow developers. — Jacob, Jul 29 '22 at 21:05
TLDR: given the context, the question doesn't make sense. Transformers operate over a sequence of tokens. For text data, we split them into wordpieces (plus learned embeddings on top). For images, there is no natural split order - thus, in multimodal models, we use image patches + positional encoding instead. Exact alignment of image and text "words" here is not important. Out of the box, contextual embedding for every wordpiece is equivalent to any of the image patches. Forcing text sentence to have exactly five patches simply doesn't make sense. — Marat, Jul 29 '22 at 21:39

How to generate multiple patches of a single string in BERT model

0 Answers0