-1

I have been trying to use DataFrameMapper to add multiple pre-processing transformations on my dataframe into my scikit-learn Pipeline.

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
names = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Schuked weight', 'Viscera weight', 'Shell weight', 'Rings']

df = pd.read_csv(url, names=names)

mapper = DataFrameMapper(
    [('Height', Normalizer()), ('Sex', LabelBinarizer())]
)

stages = []

stages += [("mapper", mapper)]

estimator = DecisionTreeClassifier()

stages += [("dtree", estimator)]

pipeline = Pipeline(stages)

labelCol = 'Rings'
target = df[labelCol]
data = df.drop(labelCol, axis=1)

train_data, test_data, train_target, expected = train_test_split(data, target, test_size=0.25, random_state=33)

model = pipeline.fit(train_data, train_target)

However, I am getting the following error:

Traceback (most recent call last):
  File "app/experimenter/sklearn/transformations.py", line 65, in <module>
    model = pipeline.fit(train_data, train_target)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 268, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 234, in _fit
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/Library/Python/2.7/site-packages/sklearn/base.py", line 497, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/Library/Python/2.7/site-packages/sklearn_pandas/dataframe_mapper.py", line 225, in transform
    stacked = np.hstack(extracted)
  File "/Library/Python/2.7/site-packages/numpy/core/shape_base.py", line 288, in hstack
    return _nx.concatenate(arrs, 1)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

What am I missing?

Thanks :)

Larissa Leite
  • 1,358
  • 3
  • 21
  • 36
  • In which line this error occurs? Please post the full stack trace. – Vivek Kumar May 21 '17 at 03:48
  • @VivekKumar updated the question – Larissa Leite May 21 '17 at 09:55
  • 1
    This error arises due to usage of normalizer. What do you expect to be its output? I mean why are you using it? To normalize the values of 'Height' column? If thats the case, then StandardScaler should be used, Normalizer is used for scaling the samples (not columns as you intend). – Vivek Kumar May 22 '17 at 05:06
  • I see that you have accepted an answer. What actually worked for you? – Vivek Kumar May 23 '17 at 05:06
  • Hi @VivekKumar in fact I was making this confusion between specifying the column as a string or as a list, so sorting this out did the trick. In this case it was only an example (I need to generalize the code to create this dataframe mapper dynamically, so it will depend on the user's input), I wasn't really focusing on this dataset transformations themselves – Larissa Leite May 26 '17 at 10:09

1 Answers1

2

You will have to alter the construction of the DataFrameMapper:

mapper = DataFrameMapper(
    [(['Height'], Normalizer()), ('Sex', LabelBinarizer())]
)

This is a subtle detail which can be found in the documentation of sklearn_pandas:

Map the Columns to Transformations

The difference between specifying the column selector as 'column' (as a simple string) and ['column'] (as a list with one element) is the shape of the array that is passed to the transformer. In the first case, a one dimensional array will be passed, while in the second case it will be a 2-dimensional array with one column, i.e. a column vector.

[...]

Be aware that some transformers expect a 1-dimensional input (the label-oriented ones) while some others, like OneHotEncoder or Imputer, expect 2-dimensional input, with the shape [n_samples, n_features].

Community
  • 1
  • 1
Jan Trienes
  • 2,501
  • 1
  • 16
  • 28