2

I'm new to Machine Learning and working on a project using python(3.6), pandas, Numpy and SKLearn.

My DataFrame is:

discount   tax   total   subtotal   productid
  3         0     20       13        002
  10        3     106      94        003
  46.49     6     21       20        004

Here's how I have performed the classification:

df_full = pd.read_excel('input/Potential_Learning_Patterns.xlsx', sheet_name=0)
df_full.head()
#for convert to numeric
df_full['discount'] = pd.to_numeric(df_full['discount'], errors='coerce')
df_full['productdiscount'] = pd.to_numeric(df_full['discount'], errors='coerce')
df_full['Class'] = ((df_full['discount'] > 20) & 
                (df_full['tax'] == 0) &
                (df_full['productdiscount'] > 20) &
                (df_full['total'] > 100)).astype(int)
print (df_full)

# Get some sample data from entire dataset
data = df_full.sample(frac = 0.1, random_state = 1)
print(data.shape)
data.isnull().sum()
# Convert excel data into matrix
columns = "invoiceid locationid timestamp customerid discount tax total subtotal productid quantity productprice productdiscount invoice_products_id producttax invoice_payments_id paymentmethod paymentdetails amount Class(0/1) Class".split()
X = pd.DataFrame.as_matrix(data, columns=columns)
Y = data.Class
# temp = np.array(temp).reshape((len(temp), 1)
Y = Y.values.reshape(Y.shape[0], 1)
X.shape
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.06)
X_test, X_dev, Y_test, Y_dev = train_test_split(X_test, Y_test, test_size = .5)

# Check if there is Classification Values - 0/1 in training set and other set 
np.where(Y_train == 1)
np.where(Y_test == 1)
np.where(Y_dev == 1)

# Determine no of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]

# calculate percentages for Fraud & Valid 
outlier_fraction = len(Fraud) / float(len(Valid))
print(outlier_fraction)

print('Fraud Cases : {}'.format(len(Fraud)))
print('Valid Cases : {}'.format(len(Valid)))

# Correlation matrix
corrmat = data.corr()
fig = plt.figure( figsize = (12, 9))

sns.heatmap(corrmat, vmax = .8, square = True)
plt.show()

Here's how I have applied reshaping :

# Get all the columns from dataframe
columns = data.columns.tolist()

# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ["Class"] ]

# store the variables we want to predicting on
target = "Class"
for column in data.columns:
    if data[column].dtype == type(object):
        le = LabelEncoder()
        data[column] = le.fit_transform(data[column])
        X = data[column]
    X = data[column]        
    Y = data[target]

    # Print the shapes of X & Y
    print(X.shape)
    print(Y.shape)
    # define a random state
state = 1

# define the outlier detection method
classifiers = {
    "Isolation Forest": IsolationForest(max_samples=len(X),
                                       contamination=outlier_fraction,
                                       random_state=state),
    "Local Outlier Factor": LocalOutlierFactor(
    n_neighbors = 20,
    contamination = outlier_fraction)
}



# fit the model
n_outliers = len(Fraud)

for i, (clf_name, clf) in enumerate(classifiers.items()):

    # fit te data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_pred = clf.negative_outlier_factor_
    else:
        clf.fit(X)
        scores_pred = clf.decision_function(X)
        y_pred = clf.predict(X)

    # Reshape the prediction values to 0 for valid and 1 for fraudulent
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1

    n_errors = (y_pred != Y).sum()

    # run classification metrics 
    print('{}:{}').format(clf_name, n_errors)
    print(accuracy_score(Y, y_pred ))
    print(classification_report(Y, y_pred ))

The code works fine till reshaping the sample and target. But when I try fit method for my classifiers it returns an error like:

ValueError: Expected 2D array, got 1D array instead: array=[1 0]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

I'm new to machine learning, what I did wrong here? I have multiple features how I can correctly reshape my sample arrays?

Help me, please! Thanks in advance!

Abdul Rehman
  • 5,326
  • 9
  • 77
  • 150

1 Answers1

3

In the following loop you are overwriting variable X with a single column (Series) in each loop iteration:

for column in data.columns:
    if data[column].dtype == type(object):
        le = LabelEncoder()
        data[column] = le.fit_transform(data[column])
        X = data[column]
    X = data[column]        #  <------- NOTE: 
    Y = data[target]

actually you can define X and Y after the loop as follows:

X = data.drop(target, 1)
Y = data[target]

the vast majority of sklrean methods do accept pandas DataFrames and Series as input data sets...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • now it returns another error `AttributeError: 'NoneType' object has no attribute 'format'` at third line from bottom. – Abdul Rehman Apr 03 '18 at 07:40
  • @AbdulRehman, first of all it's another error, which doesn't have anything to do with the question you've asked. Replace: `print('{}:{}').format(clf_name, n_errors)` --> `print('{}:{}'.format(clf_name, n_errors))` – MaxU - stand with Ukraine Apr 03 '18 at 07:43
  • Hi @MaxU, one last thing if you can help, please! it throws another error for the last line with a warning as `/Users/abdul/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for)` and error `ValueError: contamination must be in (0, 0.5]` – Abdul Rehman Apr 03 '18 at 07:47
  • @AbdulRehman, you might want to check [this question](https://stackoverflow.com/questions/34757653/why-does-scikitlearn-says-f1-score-is-ill-defined-with-fn-bigger-than-0) – MaxU - stand with Ukraine Apr 03 '18 at 07:48
  • Hi @MaxU, can you take a look at this question: https://stackoverflow.com/q/49839270/7644562 please? – Abdul Rehman Apr 15 '18 at 10:11