0

I was trying to use dask for kaggle fraud detection classification problem. But, when I build the model, model predicts all the values as 1.

I am truly surprised, since there are 56,000 zeors and 92 ones in test data, still the model somehow predicts all values as ones.

I am obviously doing something wrong. How to use the model correctly?

MWE

import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
import dask_ml
from dask_ml.xgboost import XGBClassifier
import collections
from dask_ml.model_selection import train_test_split
from dask.distributed import Client

# set up cluster
client = Client(n_workers=4)

# load the data
ifile = "https://github.com/vermaji333/MLProject/blob/master/creditcard.zip?raw=true"
#!wget https://github.com/vermaji333/MLProject/blob/master/creditcard.zip?raw=true
#ifile = 'creditcard.zip'
ddf = dd.read_csv(ifile,compression='zip',
                  blocksize=None,
                  assume_missing=True)

# train-test split
target = 'Class'

Xtr, Xtx, ytr, ytx = train_test_split(
    ddf.drop(target,axis=1), 
    ddf[target],
    test_size=0.2, 
    random_state=100,
    shuffle=True
)

# modelling
model = XGBClassifier(n_jobs=-1,
                      random_state=100,
                      scale_pos_weight=1, # default
                      objective='binary:logistic')
model.fit(Xtr,ytr)
ypreds = model.predict(Xtx)
ytx = ytx.compute()
ypreds = ypreds.compute()

# model evaluation
print(collections.Counter(ytx)) # Counter({0.0: 56607, 1.0: 92})
print(collections.Counter(ypreds)) # this gives all 1's

Update

I tried various values of scale pos weights.

I tried various scale_pos_weights
collections.Counter(ytr)
Counter({0.0: 227708, 1.0: 400})

scale_pos_weight= 227708/400
scale_pos_weight= 400/227708
scale_pos_weight= other values

But, for all parameters, I got all 1's as the result:

print(collections.Counter(ytx)) # Counter({0.0: 56607, 1.0: 92})
print(collections.Counter(ypreds)) # this gives all 1's
Counter({0.0: 56607, 1.0: 92})
Counter({1: 56699})
BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169
  • why `scale_pos_weight=700`. This is the reason. The rule of thumb `number of negative/ number of positive` does not work well for severely disbalanced data sets. Perhaps treat scale_pos_weight as a parameter to be tuned. – missuse Oct 31 '20 at 16:34
  • I removed the scale_pos_weight parameter, still I got all the outtputs to be 1. – BhishanPoudel Oct 31 '20 at 16:56
  • I recommend starting with plain XGBoost, and only moving onto Dask after you've gotten things running well on a single machine. That should help you to isolate the problem a bit. – MRocklin Nov 01 '20 at 04:01

0 Answers0