I have a dataset with continuous variables and two categorical ID variables. I would like to scale my data such that the continuous variables have the value of one for a a certain value of each ID. I try to implement it as a function in R so that I can avoid repeated coding.
Moreover, I have an example of this rescaling implemented in Stata:
gen value_var_i_k= value_var if ID1=="15t16" & ID2 =="AUS"
egen value_var_i_k_m = mean(value_var_i_k)
drop value_var_i_k
rename value_var_i_k_m value_var_i_k
gen value_var_k= value_var if ID1 =="15t16"
bys ID2: egen value_var_k_m = mean(value_var_k)
drop value_var_k
rename value_var_k_m value_var_k
gen value_var_i = value_var if ID2=="AUS"
bys ID1: egen value_var_i_m = mean(value_var_i)
drop value_var_i
rename value_var_i_m value_var_i
gen value_var_i_k_norm= value_var * value_var_i_k/(value_var_i*value_var_k)
My attempt in R to create the value_var_i value_var_k
and value_var_i_k
variables with new variables, which exits with an error (not applicable method for class "character"):
library(dplyr)
library(magrittr)
normalize<-function(var,data,i,k) {
varname <- paste("value", var , sep="_")
#Id columns and define variables to select
col1<-"ID1"
col2<-"ID2"
select_variables<-c(col2,col1,varname)
#name of the output variables
name_ik<-paste(i,k,sep="_")
name.ik<-paste(name_ik, "df", sep="_")
name.i<-paste(i, "df", sep="_")
name.k<-paste(k, "df", sep="_")
#my attempt to replicate the Stata code with dplyr
data %>% filter_(as.name(col1)==as.name(i) & as.name(col2)==as.name(k)) %>% select_( .dots=select_variables) %$% as.name( name.ik)
data %>% filter_(as.name(col1)==as.name(i)) %>% select_( .dots = select_variables ) %>% group_by_(as.name(col2)) %>%transform( interp(~mean(b, na.rm = TRUE),b=as.name(varname) )) %$% as.name(name.i)
data %>% filter_(as.name(col2)==as.name(k)) %>% select_( .dots = select_variables ) %>% group_by_(as.name(col1)) %>%transform( interp(~mean(b, na.rm = TRUE),b=as.name(varname) )) %$% as.name(name.k)
norm <- data[eval(substitute(varname)]*as.name(name.ik)/ ( as.name(name.i) * as.name(name.k) )
}
Update II: A minimal working example with the computation steps:
The variables value_var
are the same in both tables. The Stata code replaces the value_var_k
and value_var_i
variables with the variables value_var_i_m
and value_var_k_m
.
Then value_var
is transformed.
value_var ID1 ID2 value_var_i_k value_var_k value_var_k_m
1.154662 15t16 AUS 1.154662 1.154662 1.154662
1.070471 17t18 AUS 1.154662 . 1.154662
0.9643197 19 AUS 1.154662 . 1.154662
1.036398 20 AUS 1.154662 . 1.154662
1.084701 21t22 AUS 1.154662 . 1.154662
1.463215 15t16 AUT 1.154662 1.463215 1.463215
1.431824 17t18 AUT 1.154662 . 1.463215
1.276983 19 AUT 1.154662 . 1.463215
1.441925 20 AUT 1.154662 . 1.463215
1.506117 21t22 AUT 1.154662 . 1.463215
1.589491 15t16 BEL 1.154662 1.589491 1.589491
1.540076 17t18 BEL 1.154662 . 1.589491
1.188218 19 BEL 1.154662 . 1.589491
1.386074 20 BEL 1.154662 . 1.589491
1.48204 21t22 BEL 1.154662 . 1.589491
value_var ID1 ID2 value_var_i value_var_i_m
1.154662 15t16 AUS 1.154662 1.154662
1.589491 15t16 BEL . 1.154662
1.463215 15t16 AUT . 1.154662
1.070471 17t18 AUS 1.070471 1.070471
1.540076 17t18 BEL . 1.070471
1.431824 17t18 AUT . 1.070471
0.9643197 19 AUS 0.9643197 0.9643197
1.276983 19 AUT . 0.9643197
1.188218 19 BEL . 0.9643197
1.036398 20 AUS 1.036398 1.036398
1.441925 20 AUT . 1.036398
1.386074 20 BEL . 1.036398
1.084701 21t22 AUS 1.084701 1.084701
1.506117 21t22 AUT . 1.084701
1.48204 21t22 BEL . 1.084701
The normalized value var is as follows (normaliziation for ID1 level "15t16" and ID2 level "AUS" ) :
ID1 ID2 value_var_i_k_norm
AUS 15t16 1
AUS 17t18 1
AUS 19 1
AUS 20 1
AUS 21t22 1
AUT 15t16 1
AUT 17t18 1.055508
AUT 19 1.044988
AUT 20 1.097901
AUT 21t22 1.09571
BEL 15t16 1
BEL 17t18 1.045116
BEL 19 .8951011
BEL 20 .9715319
BEL 21t22 .9925373
Update: To make the normalization (or scaling) steps more clear I show here the pre after normalized data in a wide format.
First I start from the following wide data
Row-/Colnames 15t16 17t18 19t 20t 21t22
AUS 1.154662 1.070471 0.9643197 1.036398 1.084701
AUT 1.463215 1.431824 1.276983 1.441925 1.506117
BEL 1.589491 1.540076 1.188218 1.386074 1.48204
I normalize the matrix to the row with the values for AUS and the column with the values of 15t16. So, I would obtain
Row-/Colnames 15t16 17t18 19t 20t 21t22
AUS 1 1 1 1 1
AUT 1 1.055508 1.044988 1.097901 1.09571
BEL 1 1.045116 .8951011 .9715319 .9925373