-1

Does the value of max_len in pad sequences for deep learning depend upon the use case? Suppose if it was a Twitter related classification, should the value be set to 280 (280 is the maximum length of characters in tweets)?

KK47
  • 25
  • 1
  • 5

1 Answers1

0

Absolutely not, After you converted texts into sequences by tokenizer which had been fitted on list of tweets, you could iterate over these sequences to derive the length of seqeunces.

the max_len parameter in pad_sqeuences function refer to the maximum length of the sequence, so it won't mean the length of a tweet based on its characters, but also it means the length of sequence.

and after that, you don't need to set it the maximum length of the tweets sequences, even you could set it lower than that. but notice by this approach, it would be better to remove stopwords and filter characters before you fit tokenizer on the list of tweets.

Soroush Mirzaei
  • 331
  • 2
  • 12
  • So is it better to find the maximum length of the sequence or is it fine to just put an arbitrary value for max_len? – KK47 Aug 21 '22 at 09:01
  • 1
    All the things depend on the case and model. Lets have an idea about it, for example the tweets could have only up to 280 chars, and if we estimate each word is long as 5 chars, set max length equal to 56 without removing stopwords; even could define sequences length and then set max length equal to average length of the sequences, but let me explain something more than that for you in this case. – Soroush Mirzaei Aug 21 '22 at 14:18
  • 1
    cause of the wrong dictation of words in tweets, it would be better to remove stop words before, and even do it very well by magic trick; substitute the words that are the same in meaning but different in dictation, for example abbreviations 'i dont care' and 'idc' are meaning the same thing, and even trailing words 'well' and 'welll' are the same in meaning but different dictation. so it would be better to substitute these kind of words with the same word and then fed it into removing stopwords and then convert into sequences and now make desicion for its max length by the rules i said above – Soroush Mirzaei Aug 21 '22 at 14:28