0

This is my data:

a       b       c     d         e           f           g
<dbl>   <dbl>   <dbl> <dbl>     <dbl>       <dbl>       <dbl>
14.6    74529   720   4639.341  10039.323   0.3089194   0.00011135818
270.0   74529   720   4639.341  10039.323   0.3089194   0.00011135818
14.6    74529   720   4639.341  10039.323   0.3089194   0.00011135818
390.0   74529   720   4639.341  10039.323   0.3089194   0.00011135818
2000.0  74529   720   4639.341  10039.323   0.3089194   0.00011135818
2452.0  74529   720   4639.341  10039.323   0.3089194   0.00011135818
10315.0 74529   720   4639.341  10039.323   0.3089194   0.00011135818
190.6   74529   720   4639.341  10039.323   0.3089194   0.00011135818
1050.0  74529   720   4639.341  10039.323   0.3089194   0.00011135818
14.6    74529   720   4639.341  10039.323   0.3089194   0.00011135818
...

Let's say I want to create a new variable by performing addition on other variables. However, since the variables are not at comparable scales, I need to rescale them. The distributions of the variables are not normal and the normalization process also should be robust to outliers. So what is the best way to normalize data so that I can sum the variables to create a new parameter for my data?

Leyla Alkan
  • 355
  • 3
  • 12
  • Is your question not better suited for [stats exchange](https://stats.stackexchange.com)? Once you have a valid statistical method you can ask here for your issues around implementation in code. – Paul van Oppen Aug 10 '20 at 11:19

1 Answers1

-1

Use scale(x). To dispose of outliers, discard scaled values above a certain threshold, e.g., which(abs(scale(x))>3) would point out data further away from the average than 3 s.d.

Do this for every column and form the union of all outliers to be discarded from all columns before you proceed.

Niels Holst
  • 586
  • 4
  • 9