7

What is the difference between Tokenization and Segmentation in NLP. I searched about them but I didn't really find any differences .

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Mahmoud Noor
  • 186
  • 12

1 Answers1

3

Short answer: All tokenization is segmentation, but not all segmentation is tokenization.

Long Answer:
While segmentation is a more generic concept of splitting the input text, tokenization is a type of segmentation and it is carried out based on a well defined criteria.
For example - in a hypothetical scenario if all your input sentences are compound sentences of two sub-sentences, then splitting them into two independent sentences can be termed as segmentation (but not tokenization).
Tokenization is a form of segmentation which is performed on the basis of a semantic criteria or using a token dictionary - e.g. a word or sub-word tokenization, mainly with an intention of assigning them token ids for downstream processing.

jdsurya
  • 1,326
  • 8
  • 16
  • could you please give me a real world example for more clarification – Mahmoud Noor Nov 20 '21 at 20:15
  • 2
    Breaking your text corpus into sentences is segmentation, but not tokenization. Using sub-words of a sentence for generating token ids as input to a transformer model is tokenization (hence also segmentation) – jdsurya Nov 20 '21 at 22:53