How to create an automated range for dummy in R?

Question

I have the followinf DF and I want to create a dummy with automated scale to represent categorically whether a city has little, medium, or a lot of companies.

cities	sum of companies
CTY A	199
CITY B	358
CITY C	250
CITY D	1265
CITY E	610

I tried the following code:

#install.packages("scales")
library(scales)

    COMP_SCALES<- breaks_extended() #from packages Scales
    COMP_A<-COMP_SCALES(df[2], n =4)
    COMP_A <- cut(df[2], 
                          breaks=c(-Inf, COMP_A[2],COMP_A[3],COMP_A[4], Inf), 
                          labels=c("LITTLE","MEDIUM","A LOT OF","+ A LOT OF"))

However, the automatic calculated scale is not very suitable, once all the cities are on little range. How can I better automate this code?

The final porpuse is to create a table to better visualize the result with something like this:

COMP_A_CLUSTER <- as.data.frame.matrix(table(COMP_A,kmeans.k$cluster))

Expected outcome: City A Should be placed on the "Little". City B and C Should be placed on the "Medium". City E Should be placed on the "a lot of". City D should be placed on the "+ a lot of".

I have a list of more than 10,000 cities and more than 100 columns to do such a similar process and that is why I wanted the scale of the dummies to be calculated automatically.

score 1 · Answer 1 · answered Dec 16 '20 at 15:13

You can use quantile to pick intervals with equal numbers of samples in each. By default, quantile breaks into 4 intervals (probs = seq(0, 1, 0.25)), but you can specify different intervals to the probs argument.

 COMP_A <- cut(df[,2], 
               breaks=quantile(df[,2]), 
               labels=c("LITTLE","MEDIUM","A LOT OF","+ A LOT OF"))

score 1 · Accepted Answer · answered Dec 16 '20 at 15:43

You can write your own functions if you know what are the end (right) boundaries of each of the categories. Below is a simple example. DF has a new column 'CatCities' and has what you are seeking.

Following assumptions are there

The lowest value, for sum.of.companies, is greater than or equal to 0
The highest value, for sum.of.companies, is 10000 (You can change it)
'CategoryList' in the function argument is strictly increasing (from lowest to highest) and so is the argument 'EndPoints'
The length of the vectors for the arguments, 'CategoryList' and 'EndPoints', are equal in the function call

DF <- read.csv("./SomeDF.csv")
ClassifyRange <- function(x, CategoryList=c("Little","Medium","a lof of","+a lot of"),EndPoints=c(250,500,1000,10000)){
  Index <- which((EndPoints -x) >= 0)
  return(CategoryList[Index[1]])
}

DF$CatCities <- lapply(DF$sum.of.companies, FUN=ClassifyRange)

It produces the following output

How to create an automated range for dummy in R?

2 Answers2