1

I have a dataframe containing house price data, with price and lots of variables. One of these variables is a "sub-area" for the property, and I am trying to incorporate this into various regressions. However, it is a factor variable with almost 3000 levels.

For example:

table(df$sub_area)

La Jolla    
2

Carlsbad
5 

Esconsido 
1

..etc

I want to filter out those places that have only 1 count, since they don't offer much predictive power but add lots of computation time. However, I want to replace the sub_area entry for that property with blank or NA, since I still want to use the rest of the information for that property, such as bedrooms, bathrooms, etc.

For reference, an individual property entry might look like:

ID Beds Baths City     Sub_area     sqm... etc   
1   4     2    San Diego   La Jolla   100....

Then I can do

lm(price ~ beds + baths + city + sub_area)

under the new, smaller sub_area variable with fewer levels.

I want to do this because most of the predictive price power is contained in sub_area for the locations I'm working on.

Macter
  • 132
  • 1
  • 12
  • 1
    See if this is it: [Reducing number of factor levels before modelling](https://stackoverflow.com/questions/50504804/reducing-number-of-factor-levels-before-modelling). Or maybe this one [Reduce number of levels for large categorical variables](https://stackoverflow.com/questions/39066382/reduce-number-of-levels-for-large-categorical-variables?rq=1). – Rui Barradas May 25 '18 at 14:50

3 Answers3

2

One way:

areas <- names(which(table(df$Sub_area) > 10))
df$Sub_area[! df$Sub_area %in% areas] <- NA
2

Create a new dataframe with the number of occurrences for each subarea and keep the subareas that occur at least twice.

Then add NAs to the original dataframe if the subarea does not appear in the filtered sub_area_count.

library(dplyr)

sub_area_count <- df %>% 
  count(sub_area) %>% 
  filter(n > 1)

boo <- !df$sub_area %in% sub_area_count$sub_area
df[boo, ]$sub_area <- NA
D Pinto
  • 871
  • 9
  • 27
1

You didn't give a reproducible example, but I think this will work for identifying those places which count==1

count_1 <- as.data.frame(table(df$sub_area))
count_1 <- count_1$Var1[which(count_1$Freq==1)]
cirofdo
  • 1,074
  • 6
  • 22