8

I would like to standardize variables in R. I know about multiple approahces how this can be done. However, I realy like using this approach bellow:

library(tidyverse)

df <- mtcars

df %>% 
  gather() %>% 
  group_by(key) %>% 
  mutate(value = value - mean(value)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = key, values_from = value)

For some reason this approach does not work since I am not able to return the data to the original format. Therefore, I would like to ask for advice

Ric S
  • 9,073
  • 3
  • 25
  • 51
Petr
  • 1,606
  • 2
  • 14
  • 39

4 Answers4

8

According to the current documentation, you should be using across-based syntax to perform operations on a desired subset of columns. You can use everything to select all columns or use any other available qualifier. You should only use group_by verb if your desire is to perform operation on groups. group_by is not right choice for selecting variables.

mtcars %>%
    as_tibble() %>%
    mutate(across(where(is.numeric), ~ . - mean(.)))

As for the actual standardisation or any other operation you want to apply to the subset of columns you can use:

.fns Functions to apply to each of the selected columns. Possible values are:

  • NULL, to returns the columns untransformed.
  • A function, e.g. mean.
  • A purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
  • A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))

So for scale you can do:

mtcars %>%
    as_tibble() %>%
    mutate(across(where(is.numeric), scale))

or with additional arguments

mtcars %>%
    as_tibble() %>%
    mutate(across(where(is.numeric), scale, center = FALSE))

Side notes

As you can see from ?scale documentation, the function returns matrix. In case of the examples above, you will get matrix with one column if this bothers you, you can do:

mtcars %>%
    as_tibble() %>%
    mutate(across(where(is.numeric),  ~ scale(.)[,1]))

Comparison

>> mtcars %>%
...     as_tibble() %>%
...     mutate(across(where(is.numeric),  ~ scale(.)[,1])) %>% 
...     glimpse()
Rows: 32
Columns: 11
$ mpg  <dbl> 0.15088482, 0.15088482, 0.44954345, 0.21725341, -0.23073453, -0.33028740, -0.96078…
$ cyl  <dbl> -0.1049878, -0.1049878, -1.2248578, -0.1049878, 1.0148821, -0.1049878, 1.0148821, …
$ disp <dbl> -0.57061982, -0.57061982, -0.99018209, 0.22009369, 1.04308123, -0.04616698, 1.0430…
$ hp   <dbl> -0.53509284, -0.53509284, -0.78304046, -0.53509284, 0.41294217, 
...
>> 
>> 
>> mtcars %>%
...     as_tibble() %>%
...     mutate(across(where(is.numeric), scale)) %>% 
...     glimpse()
Rows: 32
Columns: 11
$ mpg  <dbl[,1]> <matrix[32 x 1]>
$ cyl  <dbl[,1]> <matrix[32 x 1]>
$ disp <dbl[,1]> <matrix[32 x 1]>
$ hp   <dbl[,1]> <matrix[32 x 1]>
...
Konrad
  • 17,740
  • 16
  • 106
  • 167
1

It is not clear why you first make data in long format and then return back to wide, neither why you don't prefer scale(df) which is computationally much faster.

Anyway, if you really want to use a code similar to the one you like, you need to perform a further unnest action in order to return the data to the original format.

df %>% 
  gather() %>% 
  group_by(key) %>% 
  mutate(value = value - mean(value)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = key, values_from = value) %>% 
  unnest(everything())

# A tibble: 32 x 11
#       mpg    cyl    disp    hp    drat       wt   qsec     vs     am   gear   carb
#     <dbl>  <dbl>   <dbl> <dbl>   <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1  0.909 -0.188  -70.7  -36.7  0.303  -0.597   -1.39  -0.438  0.594  0.312  1.19 
#  2  0.909 -0.188  -70.7  -36.7  0.303  -0.342   -0.829 -0.438  0.594  0.312  1.19 
#  3  2.71  -2.19  -123.   -53.7  0.253  -0.897    0.761  0.562  0.594  0.312 -1.81 
#  4  1.31  -0.188   27.3  -36.7 -0.517  -0.00225  1.59   0.562 -0.406 -0.688 -1.81 
#  5 -1.39   1.81   129.    28.3 -0.447   0.223   -0.829 -0.438 -0.406 -0.688 -0.812
#  6 -1.99  -0.188   -5.72 -41.7 -0.837   0.243    2.37   0.562 -0.406 -0.688 -1.81 
#  7 -5.79   1.81   129.    98.3 -0.387   0.353   -2.01  -0.438 -0.406 -0.688  1.19 
#  8  4.31  -2.19   -84.0  -84.7  0.0934 -0.0272   2.15   0.562 -0.406  0.312 -0.812
#  9  2.71  -2.19   -89.9  -51.7  0.323  -0.0673   5.05   0.562 -0.406  0.312 -0.812
# 10 -0.891 -0.188  -63.1  -23.7  0.323   0.223    0.451  0.562 -0.406  0.312  1.19 
# ... with 22 more rows

EDIT

In order to use the most up-to-date tidyr functions, you should consider substituting gather, which is retired now, with pivot_longer as in the following code. The obtained result is the same.

df %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  mutate(value = value - mean(value)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = name, values_from = value) %>% 
  unnest(everything())
Ric S
  • 9,073
  • 3
  • 25
  • 51
  • This is complex and computationally expensive solution to a trivial problem. – Konrad Jul 03 '20 at 13:02
  • 2
    I know that it is computationally intensive, in fact I wrote "why you don't prefer scale(df) which is computationally much faster". I was following the OP's indications as he/she said "I know about multiple approaches how this can be done. However, I really like using this approach below" – Ric S Jul 03 '20 at 13:05
  • Like I write in original questing - I know that there are many more sophisticated ways how to do so. Main Idea of this approach is that everyone e.g. students can *see* the formula where we subtract the mean and devide by standard deviation – Petr Jul 09 '20 at 09:43
  • @Petr so is my approach what you were looking for? Did it solve your issue? – Ric S Jul 09 '20 at 09:48
  • scale(df) is is wrong. scale() expects a matrix. It will sort of work if your df happens to only contain numerical data, but this is rare, and the returned object is no longer a data frame. Trivial counterexample: scale(iris). – nth Sep 24 '21 at 10:17
0

The warning message is heplful:

Values are not uniquely identified; output will contain list-cols.

You need a column that uniquely identifies each row:

df %>% 
  gather() %>% 
  group_by(key) %>% 
  mutate(row = row_number(), value = value - mean(value)) %>% 
  pivot_wider(names_from = key, values_from = value) %>%
  select(-row)

You'll probably want to add back rownames to the result.

Lukasz
  • 468
  • 4
  • 8
0
df %>% 
  gather() %>% 
  group_by(key) %>% 
  mutate(row = row_number(), value = value - mean(value)) %>% 
  pivot_wider(names_from = key, values_from = value) %>%
  select(-row)
taylor.2317
  • 582
  • 1
  • 9
  • 23