-1

I am struggling to wrap my head around a problem I need to resolve.

Say that we have a cars dataset (1) with many different cars that have different features (id, age, mileage, color, model,...). On the other hand, we have another dataset (2) with target cars that have the same features. Only difference is, that dataset1 has an additional column called comp_id. This column links cars from dataset2 with comparable cars from dataset1. So basically there are 5 cars in dataset1 that are similar to 1 car in dataset2.

  • Dataset1 would have 1000 datapoints (comparable cars)
  • Dataset2 would have 200 datapoints (target cars)

I am very confused how to address this problem: to begin with, I don't even know if I shall do a supervised or an unsupervised approach. Also, how can I determine which features are relevant to be chosen as comparable car?

Without getting into too complicated stuff, my first thoughts were:

  1. Supervised
  • logistic regression: create a variable "selected" with a binary outcome depending if selected or not, and treat each target (1) - comparables (5) pair as a training set. So as if I were doing a cross validation with 200 folds, each with a different target - comparable pair.
  1. Unsupervised
  • create a similarity score (cosine similarity or euclidean distance for example) for each comparable car (compared to one target), rank and take the top 5.

I would love to pick your brains too and hear what you guys think.

Thank you so much in advance!

  • You're trying to predict the price? The second dataset have the pricing while the first one doesn't? You can add a column of price to the first dataset, with the price of the similar car from the second dataset, and then treat it as your dataset. Then do logistic regression on it. – sagi Feb 20 '23 at 15:24
  • Hi @sagi! Thank you very much for your quick response. No, price is not present nor is it important. The important thing is to "select" 5 cars that compare in terms of characteristics/features. – naseriani Feb 20 '23 at 15:29

1 Answers1

0

To me this sounds more like an unsupervised problem rather than a supervised one. Besides that, with just 1200 total datapoints, I would rather not get into k-fold to get good accuracy metrics.

If you eventually choose to model it with a distance metric, just be aware to model the categoric variables (color, model...) accordingly.

rikyeah
  • 1,896
  • 4
  • 11
  • 21