Can I apply "classification" first and then "regression" to the same data set?

Question

I am a beginner in data science and need help with a topic.

I have a data set about the customers of an institution. My goal is to first find out which customers will pay to this institution and then find out how much money the paying customers will pay.

In this context, I think that I can first find out which customers will pay by "classification" and then how much will pay by applying "regression".

So, first I want to apply "classification" and then apply "regression" to this output. How can I do that?

Marc · Answer 1 · 2022-01-21T03:00:24.830

2

Sure, you can definitely apply a classification method followed by regression analysis. This is actually a common pattern during exploratory data analysis.

For your use case, based on the basic info you are sharing, I would intuitively go for 1) logistic regression and 2) multiple linear regression.

Logistic regression is actually a classification tool, even though the name suggests otherwise. In a binary logistic regression model, the dependent variable has two levels (categorical), which is what you need to predict if your customers will pay vs. will not pay (binary decision)

The multiple linear regression, applied to the same independent variables from your available dataset, will then provide you with a linear model to predict how much your customers will pay (ie. the output of the inference will be a continuous variable - the actual expected dollar value).

That would be the approach I would recommend to implement, since you are new to this field. Now, there are obviously many different other ways to define these models, based on available data, nature of the data, customer requirements and so on, but the logistic + multiple regression approach should be a sure bet to get you going.

edited Jan 21 '22 at 03:00

answered Nov 14 '20 at 14:59

Marc

2,183
2
11
16

I have a point where I am confused, I have the "id" of the customers. Before doing logistic regression, I added a column named 'label' to the 'train' set that I have and this column gets the value "1" if the customer has paid, otherwise "0". The point I stuck is, after applying the "logistic regression" to the test data, should I learn the customer "id" information from there and find out how much money these customers will pay? – snnmst Nov 14 '20 at 15:55
My take is that the customer ID should *not* be a dependent variable of your linear regression model. The prediction should not be made based on who the customer is - this should be a "blind" decision purely based on tangible information you have for each customer (their buying behaviour/pattern). "Telling" the model who the customers are (ie. using "id" as one of the inputs) would introduce unwanted bias. – Marc Nov 14 '20 at 16:02
First of all thank you for your interest. I don't know how to proceed after applying "logistic regression" to my dataset. So far, I have encountered a single model application in the examples. I applied "logistic regression" and then how can I go about it? Thanks a lot. – snnmst Nov 14 '20 at 16:30
Let's say your models have 3 input variables: ```income_level```, ```historical_weekly_spending``` and ```age```, for each customer. The 1st model (binary logistic regression), trained with binary labels for target variable ```will_buy```. Once trained, you can predict if each completely new customers will or will not buy, based his/her own features (income, spending, age) – Marc Nov 14 '20 at 16:48
Now, your second model (linear multiple regreesion) will use the same 3 dependent variables as input (```income_level```, ```historical_weekly_spending``` and ```age```). The target variable will be ```expected_spend```. You train it with historical spending amounts (labels). Once trained, you can then predict (infer) how much expect spending will come from any new customer based on 3 given features. Hope it clarifies a bit the overall approach. – Marc Nov 14 '20 at 16:51

Nikaido · Answer 2 · 2020-11-14T19:10:11.003

Another approach would be to make it a pure regression only. Without working on a cascade of models. Which will be more simple to handle

For example, you could associate to the people that are not willing to pay the value 0 to the spended amount, and fit the model on these instances.

For the business, you could then apply a threshold in which if the predicted amount is under a more or less fixed threshold, you classify the user as "non willing to pay"

score 1 · Answer 3 · answered Nov 15 '20 at 09:52

Of course you can do it by vertically stacking models. Assuming that you are using binary classification, after prediction you will have a dataframe with target values 0 and 1. You are going to filter where target==1 and create a new dataframe. Then run the regression.

Also, rather than classification, you can use clustering if you don't have labels since the cost is lower.

Can I apply "classification" first and then "regression" to the same data set?

3 Answers3