0

Hi I have seen some examples of machine learning implementations that uses as_matrix with dataframes as inputs to machine learning algorithms. I wonder if it is OK to use tuples, which are output of .as_matrix as inputs to machine learning algorithms such as below. Thanks

trainArr_All = df.as_matrix(cols_attr)  # training array
trainRes_All = df.as_matrix(col_class)  # training results
trainArr, x_test, trainRes, y_test = train_test_split(trainArr_All, trainRes_All, test_size=0.20, random_state=42)
rf = RandomForestClassifier(n_estimators=20, criterion='gini', random_state=42)  # 100 decision trees
y_score = rf.fit(trainArr, trainRes.ravel()).predict(x_test)
y_score = y_score.tolist()
s900n
  • 3,115
  • 5
  • 27
  • 35
  • 1
    Tuples are *not* the output... The output is a `numpy.ndarray`. I don't really see this method very often, and generally I would use `df.values` which achieves the same thing. – juanpa.arrivillaga May 15 '17 at 15:50

1 Answers1

1

Pandas as_matrix converts the dataframe to numpy.array (documentation) NOT tuple! sklearn assumes that the inputs are in the form of numpy arrays and if not it converts the dtype to dtype=np.float32 or a sparse csc_matrix internally. Although using pandas dataframe as input is usually fine when using a stable version of sklearn (internal conversion), you may have occasional problems due to data type incompatibility. It is usually safer to use as_matrix and convert the dataframe to numpy.array before using sklearn.

Here is an example of someone having problem with pandas dataframe: Using slices in Python

Community
  • 1
  • 1
MhFarahani
  • 960
  • 2
  • 9
  • 19