1

I am trying to create interaction variables for all 20 variables in a dataframe, so I would have in total 20 base variables and 380 interaction variables. For any single variable, I am able to create a dataframe of 19 variables by using:

in_sample[3:22] %>%
transmute(across(.cols = -c(frpm_frac_s), .fns = function(x){x*frpm_frac_s}))

But I am unable to iterate across the columns. I tried to use map over a vector of column names but am unable to get the function inside map to read as.symbol(character). Here is a sample of my data from dput:

structure(list(frpm_frac_s = c(0.870400011539459, 0.904699981212616, 
0.98089998960495, 0.838800013065338, 0.919900000095367, 0.837700009346008, 
0.84799998998642, 0.925999999046326, 0.963900029659271, 0.887899994850159
), enrollment_s = c(364, 608, 571, 705, 566, 838, 421, 757, 693, 
535), ell_frac_s = c(0.46000000834465, 0.334000021219254, 0.300999999046326, 
0.209999993443489, 0.706999957561493, 0.552999973297119, 0.412999987602234, 
0.359000027179718, 0.726000010967255, 0.646999955177307), edi_s = c(8, 
38, 39, 37, 11, 35, 15, 39, 9, 4), te_fte_s = c(23, 22, 20, 25, 
24.5, 36, 18, 30.2999992370605, 24.3999996185303, 19)), row.names = c(NA, 
10L), class = "data.frame")

When using:

 in_sample[3:22] %>%
    transmute(across(.cols = -c(frpm_frac_s), .fns = function(x){x*frpm_frac_s}))

I get:

structure(list(enrollment_s = c(316.825604200363, 550.057588577271, 
560.093894064426, 591.354009211063, 520.663400053978, 701.992607831955, 
357.007995784283, 700.981999278069, 667.982720553875, 475.026497244835
), ell_frac_s = c(0.400384012571335, 0.302169812922072, 0.295250895935631, 
0.17614799724412, 0.650369261028242, 0.463248082799339, 0.350223985351086, 
0.33243402482605, 0.699791432103968, 0.574471256869984), edi_s = c(6.96320009231567, 
34.3785992860794, 38.255099594593, 31.0356004834175, 10.118900001049, 
29.3195003271103, 12.7199998497963, 36.1139999628067, 8.67510026693344, 
3.55159997940063), te_fte_s = c(20.0192002654076, 19.9033995866776, 
19.617999792099, 20.9700003266335, 22.5375500023365, 30.1572003364563, 
15.2639998197556, 28.0577992646217, 23.5191603559875, 16.870099902153
)), row.names = c(NA, 10L), class = "data.frame")

I would like to do this for all variables and then cbind them together. Thank you for your help.

Memiya
  • 27
  • 4
  • The sample data you provided has 10 rows and 5 columns. Which are the 20 variables you're referring to? It would also help if you could provide a short example of your desired output. – Desmond Feb 24 '22 at 02:40
  • 1
    @Desmond Hello, this is a subset of my data that is smaller in both rows and columns compared to the original data. If you need the full data, I can also provide it. I will add the desired output of the one variable I managed to make it work for. – Memiya Feb 24 '22 at 02:42
  • @Memiya Are you trying to multiply all columns by first column (frpm_frac_s)? If so, you could try: `cols <- names(df)` `df[paste0(cols, "_new")] <- df[cols] * df$frpm_frac_s`. This answer is adapted from https://stackoverflow.com/questions/51841572/create-new-variable-for-many-columns-in-r – Mel G Feb 24 '22 at 02:45
  • Hello @MelG, I am trying to do this for all columns. So I have 2-way interactions between all columns. Would I be able to iterate this solution over all columns? – Memiya Feb 24 '22 at 02:49
  • Whats the aim of creating the interaction variables? Do you need them for modelling?? – Onyambu Feb 24 '22 at 02:52
  • 1
    @Onyambu Yes, I need them for modelling. – Memiya Feb 24 '22 at 02:55
  • 1
    In that case, use the formula. All R modelling function contain the model matrix within them. eg `lm(y~.^2, df)` will have all the second order interactions. So no need to create the matrix – Onyambu Feb 24 '22 at 03:00
  • 1
    @Memiya this might be [helpful reading for you](https://recipes.tidymodels.org/reference/step_interact.html). Tidymodels is a ML framework which `recipes` is a part of, and helps with preprocessing steps e.g. creating interaction variables. – Desmond Feb 24 '22 at 03:00

1 Answers1

3

You can use model.matrix to create interaction terms. (This is what's done under the hood in most modeling functions.)

m = model.matrix(~ .^2 - . + 0, data = df)
m
#    frpm_frac_s:enrollment_s frpm_frac_s:ell_frac_s frpm_frac_s:edi_s frpm_frac_s:te_fte_s
# 1                  316.8256              0.4003840            6.9632             20.01920
# 2                  550.0576              0.3021698           34.3786             19.90340
# 3                  560.0939              0.2952509           38.2551             19.61800
# 4                  591.3540              0.1761480           31.0356             20.97000
# 5                  520.6634              0.6503693           10.1189             22.53755
# 6                  701.9926              0.4632481           29.3195             30.15720
# 7                  357.0080              0.3502240           12.7200             15.26400
# 8                  700.9820              0.3324340           36.1140             28.05780
# 9                  667.9827              0.6997914            8.6751             23.51916
# 10                 475.0265              0.5744713            3.5516             16.87010
#    enrollment_s:ell_frac_s enrollment_s:edi_s enrollment_s:te_fte_s ell_frac_s:edi_s
# 1                  167.440               2912                8372.0            3.680
# 2                  203.072              23104               13376.0           12.692
# 3                  171.871              22269               11420.0           11.739
# 4                  148.050              26085               17625.0            7.770
# 5                  400.162               6226               13867.0            7.777
# 6                  463.414              29330               30168.0           19.355
# 7                  173.873               6315                7578.0            6.195
# 8                  271.763              29523               22937.1           14.001
# 9                  503.118               6237               16909.2            6.534
# 10                 346.145               2140               10165.0            2.588
#    ell_frac_s:te_fte_s edi_s:te_fte_s
# 1              10.5800          184.0
# 2               7.3480          836.0
# 3               6.0200          780.0
# 4               5.2500          925.0
# 5              17.3215          269.5
# 6              19.9080         1260.0
# 7               7.4340          270.0
# 8              10.8777         1181.7
# 9              17.7144          219.6
# 10             12.2930           76.0
# attr(,"assign")
#  [1]  1  2  3  4  5  6  7  8  9 10

Your math is a little off, because order doesn't matter in multiplication there are n * (n - 1) / 2 possibilities, (same as n choose 2), so you should expect 190 columns output for 20 columns input.

I made the formula to only include interaction terms, you can use ~ .^2 + 0 to include the first order terms too, or ~ .^2 to also include an intercept.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294