Text preprocessing in Python

Question

I would like to build a text corpus for a NLP project in Python. I've seen this text format in the LSHTC4 Kaggle challenge:

5 0:10 8:1 18:2 54:1 442:2 3784:1 5640:1 43501:1

The first number corresponds to the label.

Each set of numbers separated by ‘:‘ correspond to a (feature,value) pair of the vector, where the first number is the feature’s id and the second number its frequency (for example feature with the id 18 appears 2 times in the instance).

I don't know if this is a common way to pre-process the text data to a numeric vector. I can't find the pre-processing procedure in the challenge, the data were already pre-processed.

so what the starting `5` means? – Avinash Raj Jul 17 '15 at 14:49 — Avinash Raj, Jul 17 '15 at 14:49
It's the category to which belong your document, it's label – Bérengère Jul 17 '15 at 14:50 — Bérengère, Jul 17 '15 at 14:50

score 0 · Answer 1 · answered Jul 17 '15 at 14:57

No package necessary in R (nor in Python if I'm not mistaken). First get everything split up (and remove that initial 5). I'm guessing you want the result as numbers, not strings:

x<-"5 0:10 8:1 18:2 54:1 442:2 3784:1 5640:1 43501:1"
y<-as.integer(unlist(strsplit(x,split=" |:"))[-1])
feature<-y[seq(1,length(y),by=2)]
[1]     0     8    18    54   442  3784  5640 43501
value<-y[seq(2,length(y),by=2)]
[1] 10  1  2  1  2  1  1  1

If you want them side-by-side:

cbind(feature,value)
     feature value
[1,]       0    10
[2,]       8     1
[3,]      18     2
[4,]      54     1
[5,]     442     2
[6,]    3784     1
[7,]    5640     1
[8,]   43501     1

If you want to assign them to a data.table for analysis:

library(data.table) dt<-data.table(feature=feature,value=value)

> dt
   feature value
1:       0    10
2:       8     1
3:      18     2
4:      54     1
5:     442     2
6:    3784     1
7:    5640     1
8:   43501     1

Etc.

Text preprocessing in Python

1 Answers1