I have a dataframe containing house price data, with price and lots of variables. One of these variables is a "sub-area" for the property, and I am trying to incorporate this into various regressions. However, it is a factor variable with almost 3000 levels.
For example:
table(df$sub_area)
La Jolla
2
Carlsbad
5
Esconsido
1
..etc
I want to filter out those places that have only 1 count, since they don't offer much predictive power but add lots of computation time. However, I want to replace the sub_area entry for that property with blank or NA, since I still want to use the rest of the information for that property, such as bedrooms, bathrooms, etc.
For reference, an individual property entry might look like:
ID Beds Baths City Sub_area sqm... etc
1 4 2 San Diego La Jolla 100....
Then I can do
lm(price ~ beds + baths + city + sub_area)
under the new, smaller sub_area variable with fewer levels.
I want to do this because most of the predictive price power is contained in sub_area for the locations I'm working on.