-1

Here is a look at my dataset. I'm looking at baseball data.

structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L, 
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L, 
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L, 
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L, 
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L, 
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L, 
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L, 
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L, 
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L, 
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")

I'm trying to create a multiple linear regression and decide which predictor variables to include. The problem is, some of these variables I think are going to be really correlated with each other. For example, one of the columns is "base hits by batters (any kind of hit)" and another column is "doubles by batters" and so on. So I think if a player scores a double it would check +1 in multiple different columns.

I'm trying to figure out which variables to include and one strategy I have in mind is deciding which of these variables are correlated with each other and how strongly they are correlated. Maybe variables that are really strongly correlated with each other I won't include. (Help on this?)

I started down this road, looking at pearson correlation one-by-one:

cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_2B)

cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_3B)

cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_HR)

But then I saw how many permutations there are between all of these variables! There are 16 columns in this dataframe and I want to select any two: 16! / (2! (16 - 2)!) If my math is right, this would be 120 lines of code by doing it this method, and it would be easy to get tangled and lose track of which ones I've already done... So not very efficient.

So my original question was: Is there any efficient coding method to compare the comprehensive set of correlations between variables in a dataframe?

I then found this amazing post on Stack Overflow that I think answers my question but I still can't quite get it to work.

Side note - I also tried to figure out which columns had NA values in case NA values here made a difference.

any(is.na(moneyball_training_data$TARGET_WINS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_2B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_3B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
any(is.na(moneyball_training_data$TEAM_BATTING_SB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_CS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HBP)) # TRUE
any(is.na(moneyball_training_data$TEAM_PITCHING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_SO))# TRUE
any(is.na(moneyball_training_data$TEAM_FIELDING_E)) # FALSE
any(is.na(moneyball_training_data$TEAM_FIELDING_DP)) # TRUE

(side note - is there a more efficient way to do this an(is.na)) code?)

To continue, I now follow the direction of the other Stack Overflow answer, the tidy method, which I don't fully understand but the guy who gave the answer seemed smart:

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)

moneyball_training_data %>% 
  select(-INDEX) %>%                # remove unnecessary columns
  cor() %>%                      # get all correlations (even ones you don't care about)
  data.frame() %>%               # save result as a dataframe
  mutate(v1 = row.names(.)) %>%  # add row names as a column
  gather(v2,cor, -v1) %>%        # reshape data
  filter(f(v1,v2) & v1 != v2)

enter image description here

But how can the result just be a 3 x 3 dataframe? I expected something like my drawing below, where each number would be a correlation of an x and y with voided spaces for redundancies removed.

   1     2   3    4    5    6     7
1       12   13  14   15   16    17
2            23  24   25   26    27
3                34   35   36    37
4                     45   46    47
5                          56    57
6                                67
7
hachiko
  • 671
  • 7
  • 20
  • 2
    `cor(df[-1])` ? – Ronak Shah Mar 06 '21 at 05:50
  • "> I'm trying to figure out which variables to include and one strategy I have in mind is deciding which of these variables are correlated with each other and how strongly they are correlated. Maybe variables that are really strongly correlated with each other I won't include. (Help on this?)" - Are you trying to find the most important predictor or combination of predictors (either independent or correlated predictors)? Or make the best prediction for your outcome of interest? – jsv Mar 06 '21 at 05:55
  • @jvargh7 I think I am worried about collinearity between my predictor variables which might give me a bad interpretation in the r-squared of my linear model. I thought if I found two variables with a high correlation coefficient, then I would just pick one of them and not both? – hachiko Mar 06 '21 at 06:07
  • @RonakShah amazing I wish I thought of that, I only ever saw cor used on variables/vectors, not on a whole dataframe – hachiko Mar 06 '21 at 06:09
  • 1
    If it's a prediction problem, collinearity shouldn't be an issue. It becomes an issue when you are trying to interpret your coefficients and their standard errors. The first alternate option is to use a regularized regression (lasso/ridge/elastic net) if you don't want to manually drop correlated predictors. The second alternate option is to use some sort of dimensionality reduction (Principal Component Analysis or the like), and use the components as your predictors. – jsv Mar 06 '21 at 06:09
  • 1
    You could use car::vif() with your linear model as an input to assess multicollinearity from your fitted linear regression model. – jsv Mar 06 '21 at 06:10

2 Answers2

1

Do you expect such kind of matrix?

df <- structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L, 
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L, 
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L, 
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L, 
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L, 
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L, 
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L, 
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L, 
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L, 
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")

# install.packages("corrr")
library(corrr)
df1 <- corrr::correlate(df, method = "pearson")

# 1. Output:
# A tibble: 17 x 18
   term    INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
   <chr>   <dbl>       <dbl>          <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
 1 INDEX NA          0.642           -0.820         -0.291           0.0236          0.0826           0.205
 2 TARG~  0.642     NA               -0.685          0.367          -0.373           0.673            0.788
 3 TEAM~ -0.820     -0.685           NA              0.192           0.496          -0.449           -0.502
 4 TEAM~ -0.291      0.367            0.192         NA              -0.0789          0.640            0.653
 5 TEAM~  0.0236    -0.373            0.496         -0.0789         NA              -0.752           -0.676
 6 TEAM~  0.0826     0.673           -0.449          0.640          -0.752          NA                0.984
 7 TEAM~  0.205      0.788           -0.502          0.653          -0.676           0.984           NA    
 8 TEAM~  0.134      0.401           -0.560          0.377          -0.754           0.864            0.799
 9 TEAM~  0.790     -0.00267         -0.690         -0.356           0.413          -0.528           -0.541
10 TEAM~  0.874     -0.0332          -0.834         -0.598           0.261          -0.578           -0.623
11 TEAM~ NA         NA               NA             NA              NA              NA               NA    
12 TEAM~ -0.662     -0.923            0.733         -0.358           0.448          -0.771           -0.852
13 TEAM~ -0.352      0.308           -0.127          0.661          -0.767           0.891            0.809
14 TEAM~ -0.914     -0.793            0.736          0.0225          0.0863         -0.341           -0.464
15 TEAM~ -0.667     -0.930            0.719         -0.360           0.424          -0.757           -0.842
16 TEAM~ -0.707     -0.925            0.757         -0.314           0.418          -0.733           -0.820
17 TEAM~  0.0666     0.265           -0.144         -0.583          -0.447          -0.123           -0.150

TarJae
  • 72,363
  • 6
  • 19
  • 66
  • Awesome tahnks @TarJae I noticed this has a good advantage over the base cor function cor(df) It calculates despite NA values present and also it automatically gives result as a dataframe – hachiko Mar 06 '21 at 06:13
0

Quick answer to a side question buried in this post: more efficient way to find columns with NA values in them instead of going one-by-one

moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))
hachiko
  • 671
  • 7
  • 20