I would like to build a text corpus for a NLP project in Python. I've seen this text format in the LSHTC4 Kaggle challenge:
5 0:10 8:1 18:2 54:1 442:2 3784:1 5640:1 43501:1
The first number corresponds to the label.
Each set of numbers separated by ‘:‘ correspond to a (feature,value) pair of the vector, where the first number is the feature’s id and the second number its frequency (for example feature with the id 18 appears 2 times in the instance).
I don't know if this is a common way to pre-process the text data to a numeric vector. I can't find the pre-processing procedure in the challenge, the data were already pre-processed.