1

Goal: Determine if rfq_num_of_dealers is a significant predictor of a Done trade (Done =1).
My Data:

df_Train_Test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 139025 entries, 0 to 139024
Data columns (total 2 columns):
rfq_num_of_dealers    139025 non-null float64
Done                  139025 non-null uint8
dtypes: float64(1), uint8(1)

df_Train_Test = df_Train_Test[['rfq_num_of_dealers','Done']]
df_Train_Test_GrpBy = df_Train_Test.groupby(['rfq_num_of_dealers','Done']).size().reset_index(name='Count').sort_values(['rfq_num_of_dealers','Done'])
display(df_Train_Test_GrpBy)

Column rfq_num_of_dealers data range is 0 to 21 and column Done is either 0 or 1. Note all rfq_num_of_dealers have a Done value of 0 or 1.

enter image description here

Logistic regression:

x = df_Train_Test[['rfq_num_of_dealers']]
y = df_Train_Test['Done']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

# 2 Train and fit a logistic regression model on the training set. 
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()               # create instance of model
logmodel.fit(x_train,y_train)                 # fit model against the training data

# 3. Now predict values for the testing data.
predictions = logmodel.predict(x_test)        # Predict off the test data (note fit model is off train data)

# 4 Create a classification report for the model.
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

# 5 Create a confusion matrix for the model.
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predictions))    # The diagonals are the correct predictions

This yields the following error

 UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.       
'precision', 'predicted', average, warn_for)

The report and matrix which is clearly wrong, note the right hand side of the confusion matrix

       precision    recall  f1-score   support

          0       0.92      1.00      0.96     41981
          1       0.00      0.00      0.00      3898

avg / total       0.84      0.92      0.87     45879

[[41981     0]
 [ 3898     0]]

How can this error be raised if 'Done' has either a 1 or 0 and all are populated (y label)? Is there any code I can run to determine exactly which y labels cause the error? Other outpuuts:

display(pd.Series(predictions).value_counts())
0    45879
dtype: int64

display(pd.Series(predictions).describe())
count    45879.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
dtype: float64

display(y_test)
71738     0
39861     0
16567     0
81750     1
88513     0
16314     0
113822    0
.         .

display(y_test.describe())
count    45879.000000
mean         0.084963
std          0.278829
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: Done, dtype: float64

display(y_test.value_counts())
0    41981
1     3898
Name: Done, dtype: int64

Could this have something to do with the fact that there are 12439 records both with rfq_num_of_dealers and Done equalling zero?

Peter Lucas
  • 1,979
  • 1
  • 16
  • 27

1 Answers1

2

Precision is a ratio:

precision = tp / (tp + fp)

The error is telling you that the ratio is undefined, almost surely because the denominator is 0. That is, there are no test true positives and false positives. Looking at the commonality, these are test positives.

It is very probable that your classifier is not predicting positives at all on the test data.

Before dividing into train and test, you might want to randomize the order of your instances (or stratify) - it's possible that there's something systematic about the original order. This might solve the problem or not, but, again, it looks like the problem is lack of predicted true in the test dataset.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • sure will remove the split and see if it works. There are definitely 1's and zero's for all trading states – Peter Lucas Apr 16 '18 at 05:26
  • same result if the test train split is removed – Peter Lucas Apr 16 '18 at 12:41
  • What is the output of `predictions.value_counts()` and `predictions.describe()`? – Ami Tavory Apr 16 '18 at 12:43
  • predictions.value_counts(): AttributeError: 'numpy.ndarray' object has no attribute 'value_counts' predictions.describe(): AttributeError: 'numpy.ndarray' object has no attribute 'describe' display(predictions): array([0, 0, 0, ..., 0, 0, 0], dtype=uint8) Looks like logmodel.predict is failing – Peter Lucas Apr 16 '18 at 22:05
  • @PeterLucas OK, then `pd.Series(predictions).value_counts()` and `pd.Series(predictions).describe()`, and same for `y_test` - if you post the outcome as an edit to your question, I think it should be possible to debug it. – Ami Tavory Apr 17 '18 at 08:35
  • Hey Ami, outputs have been added. Prediction is all zero by the looks of it. – Peter Lucas Apr 17 '18 at 09:22
  • So it's the other way around: all zeros, and no ones. It looks like this dataset is too skewed for logistic regression. I'd suggest a different method (NaiveBayes is a good start, xgboost.XGBClassifier if you can install xgboost). – Ami Tavory Apr 17 '18 at 11:20
  • @AmiTavory I understand this warning is because some classes in the test never get predicted. What I don't understand is a way to understand why a classification algorithm may behave such way. For example, I fitted an `SVC` and a `RF` classifiers on same data, `RF` doesn't predict a particular class in all runs of experiments. – arilwan Jul 06 '22 at 11:21
  • @arilwan That's a great point, but really a question in itself. Why don't you post it as one? I'd also suggest a different site for it: data science stack exchange, or cross validated. It's less of a programming question. – Ami Tavory Jul 07 '22 at 05:13