3

I have a binary classification problem where I have around 15 features. I have chosen these features using some other model. Now I want to perform Bayesian Logistic on these features. My target classes are highly imbalance(minority class is 0.001%) and I have around 6 million records. I want to build a model which can be trained nighty or weekend using Bayesian logistic.

Currently, I have divided the data into 15 parts and then I train my model on the first part and test on the last part then I am updating my priors using Interpolated method of pymc3 and rerun the model using the 2nd set of data. I am checking the accuracy and other metrics(ROC, f1-score) after each run.

Problems:

  1. My score is not improving.
  2. Am I using the right approch?
  3. This process is taking too much time.

If someone can guide me with the right approach and code snippets it will be very helpful for me.

James Z
  • 12,209
  • 10
  • 24
  • 44
Ashok Rayal
  • 405
  • 3
  • 16
  • 1
    With that much data I'd at least look into trying a variational inference approach. I fit a mixed effects regression with millions of records without too much trouble using [the ADVI stuff](https://docs.pymc.io/notebooks/variational_api_quickstart.html). Unfortunately, I'm not familiar with best practices for building updatable VI models, so I can't help you there. Also [CrossValidated](https://stats.stackexchange.com) might be a better resource to answer this. – merv May 01 '19 at 02:50
  • 1
    Yes. VI process is faster but it is less accurate than sampling and I am not concerned about the time if my approach is right and score is also good. – Ashok Rayal May 01 '19 at 05:52

1 Answers1

1

You can use variational inference. It is faster than sampling and produces almost similar results. pymc3 itself provides methods for VI, you can explore that.

I only know this part of question. If you can elaborate your problem a bit further, maybe.. I can help you.