creating dummy quantile variable from continuous variable

Question

Here is the data I am working with:

x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/testmain.csv")
    data <- read.csv(text = x)

I want to make a dummy variable for the top, middle, and lower third of the values in year_hh_inc. Every value in my id column reg_schl potentially has more than one value for year_hh_inc, so the dummy variable needs to group on reg_schl. I want to be able to differentiate the values in year_hh_inc within each unique reg_schl.

so far I have the following which is posted below as solution from Sotos:

data %>% 
 group_by(reg_schl) %>%
 mutate(category = cut(year_hh_inc, breaks = (quantile(year_hh_inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>% 
 pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))

This is working well.

I have also used this solution provided by Allan:

cut_by_id <- function(x)
{
  x$category <- cut(x$year_hh_inc, quantile(x$year_hh_inc, c(0,1/3,2/3,1), na.rm = TRUE), 
                    labels = c("low","middle","high"), include.lowest = TRUE)
  return(x)
}

data <- do.call(rbind, lapply(split(data, data$id), cut_by_id))

Can you show your expected output please? – Sotos Feb 26 '20 at 10:14 — Sotos, Feb 26 '20 at 10:14

Allan Cameron · Answer 1 · 2020-02-26T11:26:19.420

3

You could use the split - lapply - rbind paradigm:

cut_by_id <- function(x)
{
  x$category <- cut(x$inc, quantile(x$inc, c(0,1/3,2/3,1), na.rm = TRUE), 
                    labels = c("low","middle","high"), include.lowest = TRUE)
  return(x)
}

data <- do.call(rbind, lapply(split(data, data$id), cut_by_id))

data
#>      id   inc fee fert fee_per_inc category
#> 1.1   1 11000 125 0.15 0.011363636      low
#> 1.2   1 15000 150 0.12 0.010000000      low
#> 1.3   1 17000 175 0.22 0.010294118   middle
#> 1.4   1 19000 200 0.13 0.010526316     high
#> 1.5   1 21000 225 0.12 0.010714286     high
#> 2.6   2 13000  55 0.11 0.004230769      low
#> 2.7   2 16000  75 0.09 0.004687500      low
#> 2.8   2 19000  85 0.23 0.004473684   middle
#> 2.9   2 21000  95 0.05 0.004523810     high
#> 2.10  2 25000 105 0.01 0.004200000     high
#> 3.11  3 18000  75 0.25 0.004166667      low
#> 3.12  3 21000  85 0.03 0.004047619      low
#> 3.13  3 23000  95 0.05 0.004130435   middle
#> 3.14  3 27000 105 0.15 0.003888889     high
#> 3.15  3 30000 115 0.25 0.003833333     high

box  <- boxplot(data$inc ~ data$category, col = 3:5)

^{Created on 2020-02-26 by the reprex package (v0.3.0)}

edited Feb 26 '20 at 11:26

answered Feb 26 '20 at 10:21

Allan Cameron

147,086
7
49
87

thank you for your answer. I am right now implementing it and I have a question about `x$category <- cut(x$fee, quantile(x$fee, c(0,1/3,2/3,1), na.rm = TRUE), ` I do not understand why fee is involved in the syntax. I want to cut by the low, middle, and high of the `inc`colum. Is that what the function does? Or is it cutting by the `fee` column values? I am not fluent with R, and I might be missing the obvious here. – Collective Action Feb 26 '20 at 10:31
1

@MichaelPerdue sorry - that should be `inc` – Allan Cameron Feb 26 '20 at 11:21
is there a way to alter the code (in the above example) such that it is not grouped by `id`, i.e. it shows the range of `inc` in the boxplot independent of the `id`. – Collective Action Mar 03 '20 at 13:31
1

@MichaelPerdue if you remove `labels=` it should label the groups with the numeric ranges. If you prefer, you can set your own breaks rather than using quantiles, and then label them according to those breaks. I would probably see what the quantiles were and pick nearby round numbers to use as breaks, then use these numbers as the labels. – Allan Cameron Mar 03 '20 at 13:49

Sotos · Accepted Answer · 2020-02-26T11:21:35.563

1

We can create your factor variable based on quantiles and spread those values, i.e.

library(dplyr)
library(tidyr)

data %>% 
 group_by(id) %>%
 mutate(category = cut(inc, breaks = (quantile(inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>% 
 pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))

which gives,

# A tibble: 15 x 8
# Groups:   id [3]
      id   inc   fee  fert fee_per_inc   low middle  high
   <dbl> <dbl> <dbl> <dbl>       <dbl> <dbl>  <dbl> <dbl>
 1     1 11000   125  0.15     0.0114      1      0     0
 2     1 15000   150  0.12     0.01        1      0     0
 3     1 17000   175  0.22     0.0103      0      1     0
 4     1 19000   200  0.13     0.0105      0      0     1
 5     1 21000   225  0.12     0.0107      0      0     1
 6     2 13000    55  0.11     0.00423     1      0     0
 7     2 16000    75  0.09     0.00469     1      0     0
 8     2 19000    85  0.23     0.00447     0      1     0
 9     2 21000    95  0.05     0.00452     0      0     1
10     2 25000   105  0.01     0.0042      0      0     1
11     3 18000    75  0.25     0.00417     1      0     0
12     3 21000    85  0.03     0.00405     1      0     0
13     3 23000    95  0.05     0.00413     0      1     0
14     3 27000   105  0.15     0.00389     0      0     1
15     3 30000   115  0.25     0.00383     0      0     1

NOTE I added the argument include.lowest = TRUE in cut in order to capture the lowest value in the first label (low)

edited Feb 26 '20 at 11:21

answered Feb 26 '20 at 11:10

Sotos

51,121
6
32
66

thank you for your answer. Examining the table I see that none of the values in `inc` for `id == 1`are marked in the `high` column. I am trying to categorizet low, med, high income for each `id` value. – Collective Action Feb 26 '20 at 11:18
1

Oh I missed the `group_by(id)` before the `mutate`...let me add it – Sotos Feb 26 '20 at 11:20
is there a way to check if all `inc`values are accounted for in either ' columns `low`, `middle`, or `high`? – Collective Action Feb 26 '20 at 13:38
I am not sure what you mean – Sotos Feb 26 '20 at 13:44
I applied the code to my main df and some of the values for `inc` are not 'categorized' in columns 'low', 'middle', 'high'. I think I will have to upload my dataframe to show what I am talking about. – Collective Action Feb 26 '20 at 13:47
1

Yes I think it's better to isolate a sample with the problem and share that – Sotos Feb 26 '20 at 13:58

creating dummy quantile variable from continuous variable

2 Answers2