1

I have a dataset composed of more than 100 columns and all columns are of type factor. Ex:

          animal               fruit               vehicle              color 
             cat              orange                   car               blue 
             dog               apple                   bus              green 
             dog               apple                   car              green 
             dog              orange                   bus              green

In my dataset i need to remove all columns with factors thas has less than 5 observations per level. In this example, if i want to remove all columns with amount of observations per levels less than or equal to 1, like blue or cat, the algorithm will remove the columns animal and color. What is the most elegant way to do this?

2 Answers2

1

We can use Filter with table

Filter(function(x) !any(table(x) < 2), df1)
#  fruit vehicle
#1 orange     car
#2  apple     bus
#3  apple     car
#4 orange     bus

data

df1 <- structure(list(animal = structure(c(1L, 2L, 2L, 2L), .Label = c("cat", 
"dog"), class = "factor"), fruit = structure(c(2L, 1L, 1L, 2L
), .Label = c("apple", "orange"), class = "factor"), vehicle = structure(c(2L, 
1L, 2L, 1L), .Label = c("bus", "car"), class = "factor"), color = structure(c(1L, 
2L, 2L, 2L), .Label = c("blue", "green"), class = "factor")),
row.names = c(NA, 
-4L), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662
0

We can use select_if from dplyr

library(dplyr)
df1 %>% select_if(~all(table(.) > 1))

#   fruit vehicle
#1 orange     car
#2  apple     bus
#3  apple     car
#4 orange     bus
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213