0

I'm performing multilevel analysis and, to check for statistical artifacts, I want to create new group level variables adding random noise to some group level variables (like school-level socioeconomic composition).

In R, the jitter() function allows to add random noise to a column easily:

jittered_variable <- jitter(variable)

However, this function add different values to a group level variable within groups.

Is there any way in R to add the same random noise to a group level variable within groups? For example, +0.5 for all the members of group 1, -0.3 for all the members of group 2, and so on.

AlejandroDGR
  • 178
  • 1
  • 10

2 Answers2

1

Here is a way.
Create a named vector with as many random values as there are groups. The names are the values that define the groups. Then add the random values matching the data with those names.

df1 <- iris[4:5]

set.seed(2023)
ngroups <- length(unique(df1$Species))
noise <- jitter(rep(0, ngroups))
noise <- setNames(noise, unique(df1$Species))
noise
#>       setosa   versicolor    virginica 
#> -0.001335442 -0.006592362 -0.013487298

df1$newcol <- df1$Petal.Width + noise[match(df1$Species, names(noise))]

# these differences are the random noise values
df1$newcol - df1$Petal.Width
#>   [1] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>   [6] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>  [11] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>  [16] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>  [21] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>  [26] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>  [31] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>  [36] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>  [41] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>  [46] -0.001335442 -0.001335442 -0.001335442 -0.001335442 -0.001335442
#>  [51] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#>  [56] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#>  [61] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#>  [66] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#>  [71] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#>  [76] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#>  [81] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#>  [86] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#>  [91] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#>  [96] -0.006592362 -0.006592362 -0.006592362 -0.006592362 -0.006592362
#> [101] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298
#> [106] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298
#> [111] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298
#> [116] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298
#> [121] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298
#> [126] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298
#> [131] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298
#> [136] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298
#> [141] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298
#> [146] -0.013487298 -0.013487298 -0.013487298 -0.013487298 -0.013487298

Created on 2023-04-02 with reprex v2.0.2

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
0

I'm going to assume a runif jittering, though this method can work with other functions/parameters as well (with modification).

Create a frame with the grouping variable(s) and the parameters for jittering, such as

jitters <- data.frame(
  cyl = c(4L, 6L, 8L),
  neg = c(0, -0.3, -0.1),
  pos = c(0.5, 0, 0.1)
)
jitters
#   cyl  neg pos
# 1   4  0.0 0.5
# 2   6 -0.3 0.0
# 3   8 -0.1 0.1

From here, we can runif across all variables at once with a simple:

set.seed(42)
out <- merge(mtcars, jitters, by = "cyl", all.x = TRUE) |>
  transform(mpg2 = mpg + runif(nrow(mtcars), neg, pos))
out
#    cyl  mpg  disp  hp drat    wt  qsec vs am gear carb  neg pos     mpg2
# 1    4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2  0.0 0.5 23.25740
# 2    4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1  0.0 0.5 23.26854
# 3    4 24.4 146.7  62 3.69 3.190 20.00  1  0    4    2  0.0 0.5 24.54307
# 4    4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1  0.0 0.5 21.91522
# 5    4 30.4  75.7  52 4.93 1.615 18.52  1  1    4    2  0.0 0.5 30.72087
# 6    4 33.9  71.1  65 4.22 1.835 19.90  1  1    4    1  0.0 0.5 34.15955
# 7    4 26.0 120.3  91 4.43 2.140 16.70  0  1    5    2  0.0 0.5 26.36829
# 8    4 30.4  95.1 113 3.77 1.513 16.90  1  1    5    2  0.0 0.5 30.46733
# 9    4 32.4  78.7  66 4.08 2.200 19.47  1  1    4    1  0.0 0.5 32.72850
# 10   4 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2  0.0 0.5 21.75253
# 11   4 27.3  79.0  66 4.08 1.935 18.90  1  1    4    1  0.0 0.5 27.52887
# 12   6 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4 -0.3 0.0 20.91573
# 13   6 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4 -0.3 0.0 20.98040
# 14   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4 -0.3 0.0 17.57663
# 15   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1 -0.3 0.0 21.23869
# 16   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1 -0.3 0.0 18.08200
# 17   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4 -0.3 0.0 19.19347
# 18   6 19.7 145.0 175 3.62 2.770 15.50  0  1    5    6 -0.3 0.0 19.43525
# 19   8 18.7 360.0 175 3.15 3.440 17.02  0  0    3    2 -0.1 0.1 18.69500
# 20   8 17.3 275.8 180 3.07 3.730 17.60  0  0    3    3 -0.1 0.1 17.31207
# 21   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4 -0.1 0.1 14.38081
# 22   8 14.7 440.0 230 3.23 5.345 17.42  0  0    3    4 -0.1 0.1 14.62774
# 23   8 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4 -0.1 0.1 10.49778
# 24   8 16.4 275.8 180 3.07 4.070 17.40  0  0    3    3 -0.1 0.1 16.48933
# 25   8 19.2 400.0 175 3.08 3.845 17.05  0  0    3    2 -0.1 0.1 19.11649
# 26   8 15.2 275.8 180 3.07 3.780 18.00  0  0    3    3 -0.1 0.1 15.20284
# 27   8 15.2 304.0 150 3.15 3.435 17.30  0  0    3    2 -0.1 0.1 15.17804
# 28   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4 -0.1 0.1 10.48115
# 29   8 15.8 351.0 264 4.22 3.170 14.50  0  1    5    4 -0.1 0.1 15.78939
# 30   8 15.5 318.0 150 2.76 3.520 16.87  0  0    3    2 -0.1 0.1 15.56720
# 31   8 15.0 301.0 335 3.54 3.570 14.60  0  1    5    8 -0.1 0.1 15.04752
# 32   8 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4 -0.1 0.1 13.36221

If you need a different function and that function doesn't vectorize the parameters like this, you may need to Vectorize(func)(nrow(mtcars), neg, pos) or similar, though that will slow things down (with large data).

We can "verify" (within reason) the jittering with:

aggregate(cbind(mpg, mpg2) ~ cyl, data = out,
          FUN = function(z) c(min = min(z), max = max(z)))
#   cyl mpg.min mpg.max mpg2.min mpg2.max
# 1   4    21.4    33.9 21.61789 34.00383
# 2   6    17.8    21.4 17.78727 21.36633
# 3   8    10.4    19.2 10.30779 19.23546

where with cyl == 4, due to the range of -0, +0.5 we expect mpg.min <= mpg2.min; with cyl == 6 and a range of -0.3, +0 we expect mpg.max >= mpg2.max; etc.

r2evans
  • 141,215
  • 6
  • 77
  • 149