0

i have been trying to learn to train my data i.e implement machine learning which has string data. all i could understand was, you can convert the string data type to categorical, but i am unable to do it using LabelEncoder. and i heard that we should not map the data and change it numerical data as its prediction will be wrong.

here is an example of the data :

LP001002,Male,No,0,Graduate,No,5849,0,,360,1,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,0,141,360,1,Urban,Y
LP001011,Male,Yes,2,Graduate,Yes,5417,4196,267,360,1,Urban,Y

as you can see, gender (2), married(3),dependant(4),education(5),self_employed(6),Property_area(11),loan_status(!2) is string.

some of the columns have missing data, so unable to use OneHot encoder. error : unordered types str() > int()

i want to convert it to categorical type and and use it as a training model for knn.i am using python 3.6.

  • Maybe you need [LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) – Vivek Kumar Jul 27 '17 at 10:27

1 Answers1

0

What you want to do is perform one-hot encoding, there is a function for that:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Coding thermodynamist
  • 1,340
  • 1
  • 10
  • 18
  • i tried it , but i get this error : unordered types str() > int() – Sriram Arvind Lakshmanakumar Jul 26 '17 at 11:25
  • You need to clean your data as a preprocessing step, unless you code you're own function to do that. Either way cleaning your data is a standard step when you implement machine learning algorithms. You can either remove the feature (the column), or the entry. You can also affect a given value: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer – Coding thermodynamist Jul 26 '17 at 11:35
  • You can even do the following when you have missing data: replace it by the value "missing" or by a specific number that you would notice easily like -9999 and the one hot encoding will work and you will have a category for missing data – Coding thermodynamist Jul 26 '17 at 12:42