1

I have a dataset of ~10,000 rows of vehicles sold on a portal similar to Craigslist. The columns include price, mileage, no. of previous owners, how soon the car gets sold (in days), and most importantly a body of text that describes the vehicle (e.g. "accident free, serviced regularly").

I would like to find out which keywords, when included, will result in the car getting sold sooner. However I understand how soon a car gets sold also depends on the other factors especially price and mileage.

Running a TfidfVectorizer in scikit-learn resulted in very poor prediction accuracy. Not sure if I should try including price, mileage, etc. in the regression model as well, as it seems pretty complicated. Currently am considering repeating the TF-IDF regression on a particular segment of the data that is sufficiently huge (perhaps Toyotas priced at $10k-$20k).

The last resort is to plot two histograms, one of vehicle listings containing a specific word/phrase and another for those that do not. The limitation here would be that the words that I choose to plot will be based on my subjective opinion.

Are there other ways to find out which keywords could potentially be important? Thanks in advance.

randomwerg
  • 47
  • 4
  • Which classifier did you use in `sklearn`? Most linear classifier should have a `_coef` attribute that tells you some thing about feature informativeness https://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers but do note if the classifier isn't linear that it gets tricky e.g. https://medium.com/usf-msds/intuitive-interpretation-of-random-forest-2238687cae45 – alvas Jan 09 '19 at 02:55
  • I am using a linear regressor. You've raised a good point - the relationship might not be linear. Will try out a polynomial regressor and see if the results improve. – randomwerg Jan 09 '19 at 09:06

1 Answers1

1

As you mentioned you could only so much with the body of text, which signifies the amount of influence of text on selling the cars.

Even though the model gives very poor prediction accuracy, you could ahead to see the feature importance, to understand what are the words that drive the sales.

Include phrases in your tfidf vectorizer by setting ngram_range parameter as (1,2) This might gives you a small indication of what phrases influence the sales of a car.

If would also suggest you to set norm parameter of tfidf as None, to check if has influence. By default, it applies l2 norm.

The difference would come based the classification model, which you are using. Try changing the model also as a last option.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 1
    Thank you for the suggestions! The original method (default `ngram_range` and `norm` values) gave a MSE of 165.15. This is huge given that most vehicles are sold within 10 days. Configuring `ngram_range` and `norm` returned an MSE in the magnitude of e22, but rerunning `train_test_split` a couple of times gave figures in the 155 range. I suppose there is no meaningful relationship between the text and the speed of the sale. Reran the regressor on a subset of the data (Jap cars). Default params gave 171.45; with `ngram_range` and `norm` it was 166.57. Will give other models a shot now. – randomwerg Jan 09 '19 at 06:16
  • what model are using? take a loot at feature importance – Venkatachalam Jan 09 '19 at 06:29
  • I am using a `LinearRegression` from scikit-learn. Aside from polynomial and decision tree would there be any other model worth trying? Also, pardon me if my question is too simple - what do you mean by feature importance? Thanks! – randomwerg Jan 09 '19 at 09:02
  • svm.SVC could be worth trying. Feature importance in case of linearRegression is just the coefficients, which the model learns from the data. `model.coef_` would give you that and compare it with feature names of the vectorizer. – Venkatachalam Jan 09 '19 at 09:07
  • Tried polynomial regression, decision tree regression, SVC and SVR, all of which gave higher MSEs than linear regression. I have looked at the coefficients of the trained model and there are some words/2-word phrases that have significant coefficients (+/- 2.5 to 4.0). Will now be simply analysing the mean/median/variance of sales with vs sales without specific keywords. Question on regression (SVR) vs classification (SVC) - does it matter which method we use here? With the no. of `y` classes around 75 (longest sale taking ~75 days), `y` has too many classes yet is discrete/non-continuous. – randomwerg Jan 09 '19 at 15:08
  • Sorry, I should have suggested you `svm.SVR()` since you are trying to predict the number days for sales, then its a Regresssion problem for sure. I was thinking about it as probability of sale previously. – Venkatachalam Jan 09 '19 at 15:57
  • 1
    Have accepted the answer and upvoted. My upvote does not show publicly due to my low rep score. Appreciate the help AI_Learning, cheers! – randomwerg Jan 09 '19 at 16:16
  • Thanks randomwalker. – Venkatachalam Jan 09 '19 at 16:20