16

I have a data frame that looks like this:

  Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     1       46.63        8.106   1409728
5     1       46.50        8.106   1554807
6     1       57.79        8.106   1439542

For each 'Store', I want to normalize/scale two columns ("Sum_sales" and "Temperature").

Desired output:

  Store Temperature Unemployment Sum_Sales
1     1       1.000        8.106   1.00000
2     1       0.000        8.106   0.94533
3     1       0.374        8.106   0.00000
4     2       0.012        8.106   0.00000
5     2       0.000        8.106   1.00000
6     2       1.000        8.106   0.20550

Here is the normalizing function that I created:

 normalit<-function(m){
   (m - min(m))/(max(m)-min(m))
 }

What I have tried:

df2 <- df %.%
  group_by('Store') %.%
  summarise(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales)))

Any suggestions/help would be greatly appreciated. Thanks.

Henrik
  • 65,555
  • 14
  • 143
  • 159
itjcms18
  • 3,993
  • 7
  • 26
  • 45

3 Answers3

21

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
    group_by(Store) %>%
    mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
    group_by(Store) %>%
    mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.

Vincent
  • 5,063
  • 3
  • 28
  • 39
3

Update with dplyr 1.0.0

The solution by @Vincent can be written in the new dplyr syntax that provides across:

df %>%
    group_by(Store) %>%
    mutate(across(c(Temperature, Sum_Sales), normalit)

If you don't remember the definition of normalit, you can use the scales package that provides the rescale function which normalizes with default parameters:

df %>%
    group_by(Store) %>%
    mutate(across(c(Temperature, Sum_Sales), scales::rescale)
Agile Bean
  • 6,437
  • 1
  • 45
  • 53
2

Here's a data.table solution. I changed your example a bit to have two type of store.

df <- read.table(header=T,text="Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     2       46.63        8.106   1409728
5     2       46.50        8.106   1554807
6     2       57.79        8.106   1439542")

library(data.table)
DT <- as.data.table(df)
DT[,list(Temperature=normalit(Temperature),Sum_Sales=normalit(Sum_Sales)),
    by=list(Store,Unemployment)]
#    Store Unemployment Temperature Sum_Sales
# 1:     1        8.106  1.00000000 1.0000000
# 2:     1        8.106  0.00000000 0.9453393
# 3:     1        8.106  0.37368421 0.0000000
# 4:     2        8.106  0.01151461 0.0000000
# 5:     2        8.106  0.00000000 1.0000000
# 6:     2        8.106  1.00000000 0.2055018

Note that your normalization will have problems if there is only 1 row for a stoer.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Nice solution @jlhoward. I assume Unemployment is added to the by list so it doesn't get dropped. Just out of curiosity, how would you keep Unemployment if it were not constant in each store and still get the desired result? – Vincent Nov 16 '14 at 08:14
  • OP wants 1 row per Store. If Unemployment is not constant for a given store, what value would you use? – jlhoward Nov 16 '14 at 17:24
  • I just wondered if there are three rows per store (as in OP's desired output), how would you keep Unemployment if not through by=list(). – Vincent Nov 17 '14 at 06:44