I am struggling to wrap my head around a problem I need to resolve.
Say that we have a cars dataset (1) with many different cars that have different features (id, age, mileage, color, model,...). On the other hand, we have another dataset (2) with target cars that have the same features. Only difference is, that dataset1 has an additional column called comp_id. This column links cars from dataset2 with comparable cars from dataset1. So basically there are 5 cars in dataset1 that are similar to 1 car in dataset2.
- Dataset1 would have 1000 datapoints (comparable cars)
- Dataset2 would have 200 datapoints (target cars)
I am very confused how to address this problem: to begin with, I don't even know if I shall do a supervised or an unsupervised approach. Also, how can I determine which features are relevant to be chosen as comparable car?
Without getting into too complicated stuff, my first thoughts were:
- Supervised
- logistic regression: create a variable "selected" with a binary outcome depending if selected or not, and treat each target (1) - comparables (5) pair as a training set. So as if I were doing a cross validation with 200 folds, each with a different target - comparable pair.
- Unsupervised
- create a similarity score (cosine similarity or euclidean distance for example) for each comparable car (compared to one target), rank and take the top 5.
I would love to pick your brains too and hear what you guys think.
Thank you so much in advance!