scale/normalize columns by group

Question

I have a data frame that looks like this:

  Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     1       46.63        8.106   1409728
5     1       46.50        8.106   1554807
6     1       57.79        8.106   1439542

For each 'Store', I want to normalize/scale two columns ("Sum_sales" and "Temperature").

Desired output:

  Store Temperature Unemployment Sum_Sales
1     1       1.000        8.106   1.00000
2     1       0.000        8.106   0.94533
3     1       0.374        8.106   0.00000
4     2       0.012        8.106   0.00000
5     2       0.000        8.106   1.00000
6     2       1.000        8.106   0.20550

Here is the normalizing function that I created:

 normalit<-function(m){
   (m - min(m))/(max(m)-min(m))
 }

What I have tried:

df2 <- df %.%
  group_by('Store') %.%
  summarise(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales)))

Any suggestions/help would be greatly appreciated. Thanks.

You have an `unemployment` column in your result. Is this the same for each store> — jlhoward, Nov 15 '14 at 20:02

Vincent · Accepted Answer · 2019-03-29T21:42:52.983

21

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
    group_by(Store) %>%
    mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
    group_by(Store) %>%
    mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.

edited Mar 29 '19 at 21:42

answered Nov 16 '14 at 08:06

Vincent

5,063
3
28
39

I am getting an error "could not find function normalit", is there some package I need to install for this? @Vincent – Agustín Indaco Dec 17 '16 at 01:26
3

See OPs normalit function – Vincent Mar 14 '17 at 12:45
Link to vignette is dead – Alex Reynolds Mar 29 '19 at 05:16
Link fixed. Thanks @AlexReynolds – Vincent Mar 29 '19 at 21:43

score 3 · Answer 2 · answered Oct 30 '21 at 12:30

Update with dplyr 1.0.0

The solution by @Vincent can be written in the new dplyr syntax that provides across:

df %>%
    group_by(Store) %>%
    mutate(across(c(Temperature, Sum_Sales), normalit)

If you don't remember the definition of normalit, you can use the scales package that provides the rescale function which normalizes with default parameters:

df %>%
    group_by(Store) %>%
    mutate(across(c(Temperature, Sum_Sales), scales::rescale)

jlhoward · Answer 3 · 2014-11-16T02:06:33.040

2

Here's a data.table solution. I changed your example a bit to have two type of store.

df <- read.table(header=T,text="Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     2       46.63        8.106   1409728
5     2       46.50        8.106   1554807
6     2       57.79        8.106   1439542")

library(data.table)
DT <- as.data.table(df)
DT[,list(Temperature=normalit(Temperature),Sum_Sales=normalit(Sum_Sales)),
    by=list(Store,Unemployment)]
#    Store Unemployment Temperature Sum_Sales
# 1:     1        8.106  1.00000000 1.0000000
# 2:     1        8.106  0.00000000 0.9453393
# 3:     1        8.106  0.37368421 0.0000000
# 4:     2        8.106  0.01151461 0.0000000
# 5:     2        8.106  0.00000000 1.0000000
# 6:     2        8.106  1.00000000 0.2055018

Note that your normalization will have problems if there is only 1 row for a stoer.

edited Nov 16 '14 at 02:06

answered Nov 15 '14 at 23:18

jlhoward

58,004
7
97
140

Nice solution @jlhoward. I assume Unemployment is added to the by list so it doesn't get dropped. Just out of curiosity, how would you keep Unemployment if it were not constant in each store and still get the desired result? – Vincent Nov 16 '14 at 08:14
OP wants 1 row per Store. If Unemployment is not constant for a given store, what value would you use? – jlhoward Nov 16 '14 at 17:24
I just wondered if there are three rows per store (as in OP's desired output), how would you keep Unemployment if not through by=list(). – Vincent Nov 17 '14 at 06:44

scale/normalize columns by group

3 Answers3

Update with dplyr 1.0.0

Linked

Related