2

I'd like to transform columns filled with strings into categorical variables so that I could run statistics. However, I am having difficulty with this transformation because I'm fairly new to Python.

Here is a sample of my code:

# Open txt file and provide column names
data = pd.read_csv('sample.txt', sep="\t", header = None,
                   names = ["Label", "I1", "I2", "C1", "C2"])
# Convert I1 and I2 to continuous, numeric variables
data = data.apply(lambda x: pd.to_numeric(x, errors='ignore'))
# Convert Label, C1, and C2 to categorical variables
data["Label"] = pd.factorize(data.Label)[0]
data["C1"] = pd.factorize(data.C1)[0]
data["C2"] = pd.factorize(data.C2)[0]

# Split the predictors into training/testing sets
predictors = data.drop('Label', 1)
msk = np.random.rand(len(predictors)) < 0.8
predictors_train = predictors[msk]
predictors_test = predictors[~msk]

# Split the response variable into training/testing sets
response = data['Label']
ksm = np.random.rand(len(response)) < 0.8
response_train = response[ksm]
response_test = response[~ksm]

# Logistic Regression
from sklearn import linear_model
# Create logistic regression object
lr = linear_model.LogisticRegression()

# Train the model using the training sets
lr.fit(predictors_train, response_train)

However, I'd get this error:

ValueError: could not convert string to float: 'ec26ad35'

The ec26ad35 value is a string from the categorical variables C1 and C2. I'm not sure what's going on: Didn't I already convert the strings into categorical variables? Why does the error say that they're still strings?

Using data.head(30), this is my data:

>> data[["Label", "I1", "I2", "C1", "C2"]].head(30)
    Label   I1   I2        C1        C2
0       0  1.0    1  68fd1e64  80e26c9b
1       0  2.0    0  68fd1e64  f0cf0024
2       0  2.0    0  287e684f  0a519c5c
3       0  NaN  893  68fd1e64  2c16a946
4       0  3.0   -1  8cf07265  ae46a29d
5       0  NaN   -1  05db9164  6c9c9cf3
6       0  NaN    1  439a44a4  ad4527a2
7       1  1.0    4  68fd1e64  2c16a946
8       0  NaN   44  05db9164  d833535f
9       0  NaN   35  05db9164  510b40a5
10      0  NaN    2  05db9164  0468d672
11      0  0.0    6  05db9164  9b5fd12f
12      1  0.0   -1  241546e0  38a947a1
13      1  NaN    2  be589b51  287130e0
14      0  0.0   51  5a9ed9b0  80e26c9b
15      0  NaN    2  05db9164  bc6e3dc1
16      1  1.0  987  68fd1e64  38d50e09
17      0  0.0    1  8cf07265  7cd19acc
18      0  0.0   24  05db9164  f0cf0024
19      0  7.0  102  3c9d8785  b0660259
20      1  NaN   47  1464facd  38a947a1
21      0  0.0    1  05db9164  09e68b86
22      0  NaN    0  05db9164  38a947a1
23      0  NaN    9  05db9164  08d6d899
24      0  0.0    1  5a9ed9b0  3df44d94
25      0  NaN    4  5a9ed9b0  09e68b86
26      1  0.0    1  8cf07265  942f9a8d
27      1  0.0   20  68fd1e64  38a947a1
28      1  0.0   78  68fd1e64  1287a654
29      1  3.0    0  05db9164  90081f33

Edit: Included error from imputing missing data after splitting dataframes into training and testing data sets. Not sure what's going on here too.

# Impute missing data
>> from sklearn.preprocessing import Imputer
>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>> predictors_train = imp.fit_transform(predictors_train)
TypeError: float() argument must be a string or a number, not 'function'
Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 1
    I don't know what that variable is, but for categorical variables you need to use [dummy variables](http://stackoverflow.com/a/37144372/2285236) in linear regression. – ayhan Jul 30 '16 at 19:31
  • 1
    If you post a sample from your dataframe I can suggest a pandas solution for that as well. – ayhan Jul 30 '16 at 19:32
  • What is your dependent variable? Is it label? If so, is it a numerical variable (it should be if you are going to use linear regression). – ayhan Jul 30 '16 at 21:12
  • @ayhan you're right. I'll make the adjustment. – Provisional.Modulation Jul 30 '16 at 21:15

1 Answers1

5

As @ayhan noted in the comments, you probably want to use dummy variables here. This is because it seems highly unlikely from your data that there is really any ordering in your text labels.

This can easily be done via pandas.get_dummies, e.g.:

pd.get_dummies(df.C1)

Note that this returns a regular DataFrame:

>>> pd.get_dummies(df.C1).columns
Index([u'05db9164', u'1464facd', u'241546e0', u'287e684f', u'3c9d8785',
     u'439a44a4', u'5a9ed9b0', u'68fd1e64', u'8cf07265', u'be589b51'],
     dtype='object')

You'd probably want to use this with a horizontal concat, therefore.


If you actually are actually looking to transform the labels into something numeric (which does not seem likely), you might look at sklearn.preprocessing.LabelEncoder.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185