Can different summary metrics of a single feature be used as a features for k-means clustering?

Question

I have a scenario where i wanted to understand the customers behavior pattern and group them into different segments/clusters for an e-commerce platform. I choose to un-supervised machine learning algorithm: k-means clustering to accomplish this task.

I have purchase_orders data available to me.

In the process of preparing my data set, i had a question: Can different summary metrics like (Sum, Avg, Min, Max, Standard Deviation) of a feature be considered into different features. Or should i take only one summary metric (say, sum of total transaction amount of a customer over multiple orders) of a feature.

Will this effect how the functioning of the k-means algorithm works?

Which of the below two data formats mentioned below, that i can feed to my algorithm be optimal to derive good results :

Format-1:

Customer ID | Total.TransactionAmount | Min.TransactionAmount | Max.TransactionAmount | Avg.TransactionAmount | StdDev.TransactionAmount | TotalNo.ofTransactions and so on...,

Format-2:

Customer ID | Total.TransactionAmount | TotalNo.ofTransactions and so on...,

(Note: Consider "|" as feature separator) (Note: Customer ID is not fed as input to the algo)

I'm voting to close this question as off-topic because it is not about programming as defined in the guidelines — desertnaut, Sep 12 '19 at 21:12
@desertnaut: Thank you. let me know which is the right platform to ask such questions..?? Not sure what made you think this is off-topic and non programming, as this is related how the algorithm processes the input data fed into it. I remember seeing questions asked in same pattern, answered and voted up previously on this platform..!! — Rajiv2806, Sep 13 '19 at 07:32
SO is about *specific coding* questions, and not a design service or discussion forum; please do take some time to read [How to Ask](https://stackoverflow.com/help/how-to-ask) and [What topics can I ask about here?](https://stackoverflow.com/help/on-topic). — desertnaut, Sep 13 '19 at 08:57
Can you list (if possible) the features that you have available in this purchase_orders that you have? But I feel the inclusion of `Sum, Avg, Min, Max, Standard Deviation` will impact your analysis, and not in a good way. These can be used to get extra information, but not for grouping customers. — a_r, Sep 13 '19 at 10:13

score 0 · Answer 1 · answered Sep 13 '19 at 10:04

Yes you can, but whether this is a good idea is all but clear.

These values will be correlated and hence this will distort the results. It will likely make all the problems you already have (such as the values not being linear, of the same importance and hence need weighting, and of similar magnitude) worse.

With features such as "transaction amount"mand "number of transactions" you already have some pretty bad scaling issues to solve, so why add more?

It's straightforward to write down your objective function. Put your features into the equation, and try to understand what you are optimizing - is this really what you need? Or do you just want some random result?

Can different summary metrics of a single feature be used as a features for k-means clustering?

1 Answers1