I am doing a project in clinical text classification. In my corpus ,data are already labelled by code (For examples: 768.2, V13.02, V13.09, 599.0 ...). I already separated text and labels then using word-embedded for text. I am going to feed them into convolution neural network. However, the labels are needs to encode, I read examples of sentiment text classification and mnist but they all used integers to classify their data, my label in text form that why I cannot use one-hot encoding like them. Could anyone suggest any way to do it ? Thanks
-
You can use one-hot encoding for discrete labels. For example, for labels "Yes", "No" and "Maybe" you can assign "No=0", "Yes=1", "Maybe=2" and then encode it into multiple binary/continuous labels. – Mephy Aug 02 '16 at 02:48
-
Thanks Mephy, my text data is classified by 45 labels. Some of the texts may have two labels at once. – ngoduyvu Aug 02 '16 at 04:25
2 Answers
Discrete text label is easily convertible to discrete numeric data by creating an enumeration mapping. For example, assuming the labels "Yes", "No" and "Maybe":
No -> 0
Yes -> 1
Maybe -> 2
And now you have numeric data, which can later be converted back (as long as the algorithm treat those as discrete values and do not return 0.5 or something like that).
In the case each instance can have multiples labels, as you said in a comment, you can create the encoding by putting each label in a column ("one-hot encoding"). Even if some software do not implement that off-the-shelf, it is not hard to do by hand.
Here's a very simple (and not well-written to be honest) example using Panda's get_dummies function:
import numpy as np
import pandas as pd
labels = np.array(['a', 'b', 'a', 'c', 'ab', 'a', 'ac'])
df = pd.DataFrame(labels, columns=['label'])
ndf = pd.get_dummies(df)
ndf.label_a = ndf.label_a + ndf.label_ab + ndf.label_ac
ndf.label_b = ndf.label_b + ndf.label_ab
ndf.label_c = ndf.label_c + ndf.label_ac
ndf = ndf.drop(['label_ab', 'label_ac'], axis=1)
ndf
label_a label_b label_c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 1.0
4 1.0 1.0 0.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
You can now train a multivariate model to output the values of label_a
, label_b
and label_c
and then reconstruct the original labels like "ab". Just make sure the output is in the set [0, 1] (by applying softmax-layer or something like that).

- 2,978
- 3
- 25
- 31
-
Thanks Merphy, I got you ideal. Coud I use get_dummies to convert categorical labels into integer number then use it to feed into one_hot encoding ? – ngoduyvu Aug 02 '16 at 11:44
-
Hey Merphy, I got your Ideal but My project implement in Tensorflow. The neural network only takes Tensor not array, Do you know how could I implement these categorical data into numerical data which have a similar result as your in Tensorflow ? – ngoduyvu Aug 06 '16 at 06:29
Watch this 4 mins video (Corsera: ML classification (University of Washington)-> Week1 -> Encoding Categorical Inputs) https://www.coursera.org/learn/ml-classification/lecture/kCY0D/encoding-categorical-inputs
There are two methods of encoding:
One Hot Encoding
Bag of words (I think this is more suitable method in this case)
Following diagram describes how bag of words method works. Text can have say 10,000 different words that come from it, or more, many more, millions. And so what Bag of Words does is take that text, and then codes its as counts.
Edit 1
Python Implementation: Visit http://www.python-course.eu/text_classification_python.php

- 12,289
- 5
- 46
- 47
-
Thanks Sayali, I got your ideal so I have 46 labels, I create an array of 46 and when the text the label, I set it to 1 such as (00100000..0). I really don't know how to write the code for this, do you know any function in python do that ? – ngoduyvu Aug 02 '16 at 10:34
-
https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words – Sayali Sonawane Aug 02 '16 at 10:37
-
@ngoduyvu http://www.python-course.eu/text_classification_python.php Visit this page for python implementation part – Sayali Sonawane Aug 02 '16 at 11:16
-
Hi Sayali, your website is a great help. However, I already understand the pre-train words step. Now I have trouble will encoding labels. In your example in kaggle, they could encoding positive or negative sentiment review by "01" or "10". But in my data, I have different code corresponding to different kind of diseases (in string form) , to feed into machine learning algorithm I have to encode them into number. – ngoduyvu Aug 02 '16 at 11:56