I have a dataset of ~10,000 rows of vehicles sold on a portal similar to Craigslist. The columns include price, mileage, no. of previous owners, how soon the car gets sold (in days), and most importantly a body of text that describes the vehicle (e.g. "accident free, serviced regularly").
I would like to find out which keywords, when included, will result in the car getting sold sooner. However I understand how soon a car gets sold also depends on the other factors especially price and mileage.
Running a TfidfVectorizer in scikit-learn resulted in very poor prediction accuracy. Not sure if I should try including price, mileage, etc. in the regression model as well, as it seems pretty complicated. Currently am considering repeating the TF-IDF regression on a particular segment of the data that is sufficiently huge (perhaps Toyotas priced at $10k-$20k).
The last resort is to plot two histograms, one of vehicle listings containing a specific word/phrase and another for those that do not. The limitation here would be that the words that I choose to plot will be based on my subjective opinion.
Are there other ways to find out which keywords could potentially be important? Thanks in advance.