R Function normalize data frame variable

Question

I have a dataset with continuous variables and two categorical ID variables. I would like to scale my data such that the continuous variables have the value of one for a a certain value of each ID. I try to implement it as a function in R so that I can avoid repeated coding.

Moreover, I have an example of this rescaling implemented in Stata:

 gen value_var_i_k= value_var if ID1=="15t16" & ID2 =="AUS"
 egen value_var_i_k_m = mean(value_var_i_k)
 drop value_var_i_k
 rename value_var_i_k_m value_var_i_k

 gen value_var_k= value_var if ID1 =="15t16"
 bys ID2: egen value_var_k_m = mean(value_var_k)
 drop value_var_k
 rename value_var_k_m value_var_k

 gen value_var_i = value_var if ID2=="AUS"
 bys ID1:   egen value_var_i_m = mean(value_var_i)
 drop value_var_i
 rename value_var_i_m value_var_i

 gen value_var_i_k_norm= value_var * value_var_i_k/(value_var_i*value_var_k)

My attempt in R to create the value_var_i value_var_k and value_var_i_k variables with new variables, which exits with an error (not applicable method for class "character"):

library(dplyr)
library(magrittr)
normalize<-function(var,data,i,k) { 

varname <- paste("value", var , sep="_")
#Id columns and define variables to select
col1<-"ID1"
col2<-"ID2"
select_variables<-c(col2,col1,varname)

#name of the output variables
name_ik<-paste(i,k,sep="_")
name.ik<-paste(name_ik, "df", sep="_")

name.i<-paste(i, "df", sep="_")

name.k<-paste(k, "df", sep="_")

#my attempt to replicate the Stata code with dplyr

data %>% filter_(as.name(col1)==as.name(i) & as.name(col2)==as.name(k)) %>% select_( .dots=select_variables) %$% as.name( name.ik)

data  %>% filter_(as.name(col1)==as.name(i)) %>% select_( .dots = select_variables ) %>%  group_by_(as.name(col2)) %>%transform( interp(~mean(b, na.rm = TRUE),b=as.name(varname) ))  %$% as.name(name.i)


data %>% filter_(as.name(col2)==as.name(k)) %>% select_( .dots =    select_variables ) %>%  group_by_(as.name(col1)) %>%transform( interp(~mean(b, na.rm = TRUE),b=as.name(varname) )) %$%  as.name(name.k)

norm    <- data[eval(substitute(varname)]*as.name(name.ik)/ ( as.name(name.i) * as.name(name.k) )

}

Update II: A minimal working example with the computation steps: The variables value_var are the same in both tables. The Stata code replaces the value_var_k and value_var_i variables with the variables value_var_i_m and value_var_k_m. Then value_var is transformed.

  value_var  ID1    ID2  value_var_i_k value_var_k  value_var_k_m

1.154662    15t16   AUS 1.154662    1.154662     1.154662
1.070471    17t18   AUS 1.154662        .        1.154662
0.9643197   19      AUS 1.154662        .        1.154662
1.036398    20      AUS 1.154662        .        1.154662
1.084701    21t22   AUS 1.154662        .        1.154662

1.463215    15t16   AUT 1.154662    1.463215     1.463215
1.431824    17t18   AUT 1.154662        .        1.463215
1.276983    19      AUT 1.154662        .        1.463215
1.441925    20      AUT 1.154662        .        1.463215
1.506117    21t22   AUT 1.154662        .        1.463215

1.589491    15t16   BEL 1.154662    1.589491     1.589491
1.540076    17t18   BEL 1.154662        .        1.589491
1.188218    19      BEL 1.154662        .        1.589491
1.386074    20      BEL 1.154662        .        1.589491
1.48204   21t22     BEL 1.154662        .        1.589491


value_var   ID1    ID2  value_var_i value_var_i_m
1.154662    15t16   AUS 1.154662    1.154662
1.589491    15t16   BEL   .         1.154662
1.463215    15t16   AUT   .         1.154662

1.070471    17t18   AUS 1.070471    1.070471
1.540076    17t18   BEL   .         1.070471
1.431824    17t18   AUT   .         1.070471

0.9643197   19      AUS 0.9643197   0.9643197
1.276983    19      AUT     .       0.9643197
1.188218    19      BEL     .       0.9643197

1.036398    20      AUS 1.036398    1.036398
1.441925    20      AUT     .       1.036398
1.386074    20      BEL     .       1.036398

1.084701    21t22   AUS 1.084701    1.084701  
1.506117    21t22   AUT      .      1.084701
1.48204     21t22   BEL      .      1.084701

The normalized value var is as follows (normaliziation for ID1 level "15t16" and ID2 level "AUS" ) :

      ID1    ID2  value_var_i_k_norm

      AUS   15t16   1
      AUS   17t18   1
      AUS   19      1
      AUS   20      1
      AUS   21t22   1
      AUT   15t16   1
      AUT   17t18   1.055508
      AUT   19      1.044988
      AUT   20      1.097901
      AUT   21t22   1.09571
      BEL   15t16   1
      BEL   17t18   1.045116
      BEL   19      .8951011
      BEL   20      .9715319
      BEL   21t22   .9925373

Update: To make the normalization (or scaling) steps more clear I show here the pre after normalized data in a wide format.

First I start from the following wide data

 Row-/Colnames     15t16        17t18        19t         20t      21t22
   AUS           1.154662   1.070471    0.9643197   1.036398    1.084701
   AUT           1.463215   1.431824    1.276983    1.441925    1.506117
   BEL           1.589491   1.540076    1.188218    1.386074    1.48204

I normalize the matrix to the row with the values for AUS and the column with the values of 15t16. So, I would obtain

     Row-/Colnames  15t16      17t18         19t         20t      21t22
          AUS          1         1          1           1           1
          AUT          1    1.055508    1.044988    1.097901    1.09571
          BEL          1    1.045116    .8951011    .9715319    .9925373

I don't understand how what you're trying to do counts as [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)). But I also don't understand your process. How does the final row end up as `1.05, 17t18, AUT`? Why is the `value_var` being changed in that case? — alexwhitworth, May 20 '16 at 21:10
I normalize the value variable for a specific level of ID1 (AUS) and a specific level of ID2 (15t16). The result should be that the value_var=1 when ID1=AUS or ID2=15t16. I posted a larger example of the data, so that is clearer. I normalize all entries of the matrix subject to one column and one row. The entries of the normalizing row and column are 1 after the transformation — S.K., May 20 '16 at 22:08

score 2 · Answer 1 · edited May 21 '16 at 08:14

2

So, if I understand you correctly based on your edit.

This is your "normalization" routine:

If (ID1 == "AUS" | ID2 == "15t16") {value_var == 1}
else {value_var unchanged}

That's easily accomplished via ifelse (assuming input data.frame named df):

df$value_var <- ifelse(df$ID1 == "AUS" | df$ID2 == "15t16", 1, df$value_var)

Again, I'm loath to call this "normalization", but it does what you describe. Typically normalization, in a statistical context, means scaling--i.e. subtracting the mean and dividing by the SD. But you don't appear to be doing that here.

edited May 21 '16 at 08:14

Nick Cox

35,529
6
31
47

answered May 20 '16 at 22:09

alexwhitworth

4,839
5
32
59

"normalize" is a treacherous word: across mainstream statistics, machine learning, databases, linear algebra, etc. look for the equation or code it refers to; never trust that it means what is familiar to you from your home territory. – Nick Cox May 21 '16 at 08:44

Nick Cox · Answer 2 · 2016-05-21T09:47:31.643

This is too long to fit as a comment. But I can cut down your Stata code from 13 lines to 5, and I hope thereby make what you are doing clearer, to people using Stata or R or both. I keep your code blocks and add comments.

* no need to use -egen- to create a single mean; use -summarize- results 
summarize value_var if ID1=="15t16" & ID2=="AUS", meanonly 
gen value_var_i_k = r(mean) 

* evaluating ID1 == "15t16" yields 1 or 0; dividing by 0 yields missing values 
* which are ignored, which is what you want 
bys ID2: egen value_var_k = mean(value_var / (ID1=="15t16")) 

* same device as previous block 
bys ID1: egen value_var_i = mean(value_var / (ID2=="AUS")) 

gen value_var_i_k_norm = value_var * value_var_i_k/(value_var_i * value_var_k)

In fact, we could cut down further, but this version is a little less clear:

bys ID2: egen value_var_k = mean(value_var / (ID1=="15t16")) 
bys ID1: egen value_var_i = mean(value_var / (ID2=="AUS")) 
summarize value_var if ID1=="15t16" & ID2=="AUS", meanonly     
gen value_var_i_k_norm = value_var * r(mean)/(value_var_i * value_var_k)

For the division by zero device, see this article, section 10. An alternative way to ignore certain observations is

bys ID1: egen value_var_i = mean(cond(ID2=="AUS", value_var, .))

which is discussed in the same article, section 9.

Either way, a key detail is that the mean() function of egen ignores missing values.

S.K. · Accepted Answer · 2016-05-24T12:30:46.103

I now see that my attempt to translate the Stata code led me astray in R. The normalization variables are simply the results of filtering the original data frame with respect to the specific ID value and in the next step I inner join them with the original data frame.

normalize<-function(data,var,col1,col2,i,k) { 

varname <- paste("z.ik", var , sep="_")
var.i<-paste(var,i, sep="_")
var.k<-paste(var,k, sep="_")
var.ik<-paste(var.i,k, sep="_")
varname_i <- paste("z", var.i , sep="_")
varname_k <- paste("z", var.k , sep="_")
varname_ik <- paste("z", var.ik , sep="_")

d2<-select_(data, varname, col1, col2)
filter_crit2 = interp(~ filter_var1 %in% i & filter_var2 %in% k   ,.values = list(filter_var1 = as.name(col1),filter_var2 = as.name(col2)))
filter_crit_k = interp(~  filter_var2 %in% k ,.values =   list(filter_var2 = as.name(col2)))
filter_crit_i = interp(~ filter_var1 %in% i  ,.values =   list(filter_var1 = as.name(col1)))
select_variables.k<-c(col2,varname)
select_variables.i<-c(col1,varname)
results_ik<- data %>% filter_(    filter_crit2 )  %>% select_( .dots=varname )  %>%rename_(.dots=setNames( varname, varname_ik))
results_ik<-cbind(  results_ik ,d2)
results_i<- data %>% filter_(    filter_crit_i )  %>% select_( .dots=select_variables.k) %>%rename_(.dots=setNames( varname, varname_i)) %>%inner_join( ., results_ik)
results_k<- data %>% filter_(    filter_crit_k )  %>% select_( .dots=select_variables.i) %>%rename_(.dots=setNames( varname, varname_k)) %>% inner_join( .,   results_i)

  mutate_fn <- function(d_in,  varval,  varname_norm){
   d_out = d_in %>%
  mutate_(.dots = setNames(  varval,  varname_norm))
  }
d_in=results_k
varname_norm<- paste("z", var, sep="_")
varname_norm <- paste(varname_norm,"norm", sep="_")
varval <- lazyeval::interp(~ var1* var2 / (var3 * var4),.values=   list( var1=as.name(varname) , var2=as.name(varname_ik),    var3=as.name(varname_i), var4=as.name(varname_k)))  

data.norm = mutate_fn(d_in,varval,varname_norm )
n<-length(data.norm)
names(data.norm)[n]<-varname_norm 
data.norm <-select_(       data.norm,.dots=c(col1,col2,varname,varname_norm))
 }

R Function normalize data frame variable

3 Answers3