1

I'm trying to implement a Question answering system that deal with large input text: so the idea is to split the large input text into subsequences of 510 tokens, after I will generate the representation of each sequence independently and using a pooling layer to generate the final representation of the input sequence.

I using the CamemBERT model for French language.

I have tried the following code:

class CamemBERTQA(nn.Module):

# the initialization of the model
   def __init__(self, do_lower_case: bool = True):
       super(CamemBERTQA, self).__init__()
       self.config_keys = ['do_lower_case']
       self.do_lower_case = do_lower_case
       self.camembert = CamembertForQuestionAnswering.from_pretrained('fmikaelian/camembert-base-fquad')
       self.tokenizer = CamembertTokenizer.from_pretrained('fmikaelian/camembert-base-fquad', do_lower_case=do_lower_case)
       self.cls_token_id = self.tokenizer.convert_tokens_to_ids([self.tokenizer.cls_token])[0]
       self.sep_token_id = self.tokenizer.convert_tokens_to_ids([self.tokenizer.sep_token])[0]
       self.pool = nn.MaxPool2d(2, 2)


# Split long input text into subsequences with overlapping
   def split_text(self, text, max_length, overlapp): #511 max
       f = []
       text = text.split()
       for i in range(0, int(len(text)-overlapp),(max_length-overlapp)):
           f.append(" ".join(text[i:i+max_length]))
#             print (f)
       return f

# Generate representation of a text,
   def text_representation(self, l): #  l here is a list
       result = []
       for i in l:
           input_ids = torch.tensor([self.tokenizer.encode(i, add_special_tokens=True)])
           with torch.no_grad():
               last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
               result.append(last_hidden_states)
#                     print(last_hidden_states[0])
       return result


   def forward(self, text, input_ids):
       # Split input text to subsequences of 511 with overlapping
       subsequences = self.split_text(text, 511, 10)

       # Generate IDs of each subsequence (Sequence representation)
       input_ids_list = self.text_representation(subsequences)
       print("input_ids_list")


       # Pooling layer
#         pool = self.pool(...)


###########      The problem is here: how can I add a pooling layer                  #################


#         input_ids = # the final output of the pooling layer, the result should contain 510 elements/tokens

       # generate the start and end logits of the answer
       start_scores, end_scores = self.camembert(torch.tensor([input_ids]))
       start_logits = torch.argmax(start_scores)
       end_logits = torch.argmax(end_scores)+1
       outputs = (start_logits, end_logits,)
#         print(outputs)

       return outputs

Since I'm a beginner with pyTorch, I'm not sure about if the code should be like that.

Please if you have any advice or if you need any further information contact me.

John Smith
  • 199
  • 1
  • 1
  • 10

1 Answers1

1

I'm pretty new to all of this myself, but maybe this could help you:

    def max_pooling(input_tensor, max_sequence_length):
    
        mxp = nn.MaxPool2d((max_sequence_length, 1),stride=1)
        return mxp(input_tensor)
Anastasia
  • 91
  • 1
  • 1
  • 7