-1

I'm trying to implement the LinearDiscriminantAnalysis from sklearn for that here is what I've done so far:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
import pandas as pd
# Reading csv file
training_file = 'Training.csv'
testing_file  = 'Test.csv'
dataframe_train = pd.read_csv(training_file)
dataframe_test  = pd.read_csv(testing_file)
dataframe_train['onehot_code'] = dataframe_train.apply(lambda x : onehot_processing(int(float(x['Onehot'])), numberOFclasses), axis=1)
dataframe_test['onehot_code'] = dataframe_test.apply(lambda x:onehot_processing(int(float(x['Onehot'])),numberOFclasses),axis=1)
stdsc = preprocessing.StandardScaler()
np_scaled_train = stdsc.fit_transform(dataframe_train.iloc[:,:-3])
np_scaled_test  = stdsc.transform(dataframe_test.iloc[:,:-3])

lda = LinearDiscriminantAnalysis(n_components=2)
Training_Frame = lda.fit_transform(np_scaled_train,dataframe_train.iloc[:,-1])  # the script crashes here 

Testing_Frame  = lda.transform(np_scaled_test)

The error message that I get is:

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.

The shapes of the dataframe are correct, So I don't get what I'm missing or what should I convert so that the function accepts the parameter, or is the cause something else ?

I'll be grateful for any hint!

Update

Here's howdataframe_train.iloc[:,-1]looks like :

    0       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
8       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
10      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
11      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
12      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
13      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
14      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
15      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
16      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
17      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
18      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
19      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
20      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
21      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
22      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
23      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
24      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
25      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
26      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
27      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
28      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
29      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
                              ...                        
2328    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2329    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2330    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2331    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2332    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2333    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2334    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2335    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2336    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2337    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2338    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2339    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2340    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2341    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2342    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2343    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2344    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2345    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2346    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2347    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2348    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2349    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2350    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2351    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2352    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2353    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2354    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2355    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2356    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2357    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: onehot_code, dtype: object

each row is a vector of 20 elements .

**2nd_UPDATE"

Running the following : Training_Frame = lda.fit_transform(np_scaled_train,np.asarray(dataframe_train.iloc[:,-1]))

delivers this error message:

        ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-7-a8adf693ad9e> in <module>()
    ----> 1 Training_Frame = lda.fit_transform(np_scaled_train,np.asarray(dataframe_train.iloc[:,-1]))

    c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
        495         else:
        496             # fit method of arity 2 (supervised transformation)
    --> 497             return self.fit(X, y, **fit_params).transform(X)
        498 
        499 

    c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\discriminant_analysis.py in fit(self, X, y, store_covariance, tol)
        441             self.tol = tol
        442         X, y = check_X_y(X, y, ensure_min_samples=2, estimator=self)
    --> 443         self.classes_ = unique_labels(y)
        444 
        445         if self.priors is None:  # estimate priors from sample

    c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\utils\multiclass.py in unique_labels(*ys)
         77     # Check that we don't mix label format
         78 
    ---> 79     ys_types = set(type_of_target(x) for x in ys)
         80     if ys_types == set(["binary", "multiclass"]):
         81         ys_types = set(["multiclass"])

    c:\users\engine\appdata\local\programs\python\python35\lib\site-packages

\sklearn\utils\multiclass.py in <genexpr>(.0)
     77     # Check that we don't mix label format
     78 
---> 79     ys_types = set(type_of_target(x) for x in ys)
     80     if ys_types == set(["binary", "multiclass"]):
     81         ys_types = set(["multiclass"])

c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\utils\multiclass.py in type_of_target(y)
    248         if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
    249                 and not isinstance(y[0], string_types)):
--> 250             raise ValueError('You appear to be using a legacy multi-label data'
    251                              ' representation. Sequence of sequences are no'
    252                              ' longer supported; use a binary array or sparse'

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.
Engine
  • 5,360
  • 18
  • 84
  • 162
  • Have you deleted the old question and posted it again as a new one? Anyways need sample of data to test the code, specifically `dataframe_train.iloc[:,-1]`. – Vivek Kumar Jun 02 '17 at 07:46
  • 2
    I have a hunch that you need to process your `y` (`dataframe_train.iloc[:,-1]`) with [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html). – Vivek Kumar Jun 02 '17 at 07:51
  • @VivekKumar Thanks for your help I've updated the question ! – Engine Jun 02 '17 at 08:12
  • Each row is a vector of 20 elements. Can a row contain multiple 1's. If not, then no need to one-hot encode this column. If yes, then convert it to numpy array (as you have already one-hot encoded them) before sending it to fit(), like this:- `np.asarray(dataframe_train.iloc[:,-1])`. – Vivek Kumar Jun 02 '17 at 08:38
  • @VivekKumar If u mean : Training_Frame = lda.fit_transform(np_scaled_train,np.asarray(dataframe_train.iloc[:,-1])) it's not working : ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead. – Engine Jun 02 '17 at 08:44
  • Can you post the full stack trace please? – Vivek Kumar Jun 02 '17 at 08:46

1 Answers1

0

This is what works for me, when tried duplicating your example.

y_train = dataframe_train.iloc[:,-1]
y_test = dataframe_test.iloc[:,-1]

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
y_train  = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

lda = LinearDiscriminantAnalysis(n_components=2)
Training_Frame = lda.fit_transform(np_scaled_train, y_train)  

Testing_Frame  = lda.transform(np_scaled_test)

The error is most probably due to how pandas handle the lists in a column, and how numpy interprets them. The scikit-learn checks if supplied y is a numpy array of supported types (dtypes) (int, float, string, etc), but in your case df.iloc[:, -1] returns a pandas.Series which when directly converted to numpy, results in dtype = object. And hence the error.

One more workaround is (without using any of the code above):

Training_Frame = lda.fit_transform(np_scaled_train, 
                     np.array([np.array(r) for r in dataframe_train‌​.iloc[:,-1]]))

Hope it works for you.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132