1

IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters: https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .

However the trained model can then classify every text whose length is at most 2048 characters: https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .

This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.

Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?

Thank you in advance!

Rosa
  • 15
  • 3

1 Answers1

0

Recently, I had the same question and one of the answers on an article clarified the same

Currently, the limits are set at 1024 for training and 2048 for testing/classification. The 1024 limit may require some curation of the training data prior to training. Most organizations who require larger character limits for their data end up chunking their input text into 1024 chunks. Additionally, in use cases with data similar to the Airbnb reviews, the primary category can typically be assessed within the first 2048 characters since there is often a lot of noise in lengthy reviews.

Here's the link to the article

Vidyasagar Machupalli
  • 2,737
  • 1
  • 19
  • 29
  • @Rosa Post your questions here. Will try to answer best of knowledge – Vidyasagar Machupalli Nov 28 '18 at 10:46
  • Thank you! This link shows 2 scenarios but none of them clarifies my doubts. If 2048 chars are enough to retrieve the relevant information but 1024 chars are not, I will end up with a classifier that is trained on irrelevant text and so using 2048 chars during testing would return bad performance because of the 1024 chars limitation during training. If 2048/1024 chars are not enough, it is true that I could chunk the original input text into smaller groups, but it is not trivial to merge all the resulting classes into one final class. What do you think about that? – Rosa Nov 28 '18 at 10:57
  • Reading through the best practices presentation here - the last slide talks about Decomposing a large dataset and then using NLC on top of it Here's the link to the presentation https://www.ibm.com/watson/assets-watson/pdf/Watson-NLC-Links-Best-Practices-Design-Patterns.pdf – Vidyasagar Machupalli Nov 28 '18 at 12:06
  • Hi, thank you, now the case in which 1024 chars are not enough is clear: we need to apply some business rules or meta-classifiers that can summarize the results at sentence level into one single final result. However, in case 1024 chars are already enough to train a good classifier, I am still wondering whether I should use in production a cut of 1024 or 2048: since I train a model on text of at most 1024 chars, shouldn't I cut also in production at 1024? Or is it better to give Watson as much info as possible and so cut at 2048? – Rosa Nov 28 '18 at 15:48
  • I would say stick with 2048 for testing/classification (production) – Vidyasagar Machupalli Nov 29 '18 at 00:19