I have a scenario where i wanted to understand the customers behavior pattern and group them into different segments/clusters for an e-commerce platform. I choose to un-supervised machine learning algorithm: k-means clustering to accomplish this task.
I have purchase_orders data available to me.
In the process of preparing my data set, i had a question: Can different summary metrics like (Sum, Avg, Min, Max, Standard Deviation)
of a feature be considered into different features. Or should i take only one summary metric (say, sum of total transaction amount of a customer over multiple orders) of a feature.
Will this effect how the functioning of the k-means algorithm
works?
Which of the below two data formats mentioned below, that i can feed to my algorithm be optimal to derive good results :
Format-1:
Customer ID | Total.TransactionAmount | Min.TransactionAmount | Max.TransactionAmount | Avg.TransactionAmount | StdDev.TransactionAmount | TotalNo.ofTransactions and so on...,
Format-2:
Customer ID | Total.TransactionAmount | TotalNo.ofTransactions and so on...,
(Note: Consider "|" as feature separator) (Note: Customer ID is not fed as input to the algo)