2

I have a time-series panel dataset which is structured in the following way:

df <- data.frame(
  year = c(2012L, 2013L, 2014L, 2012L, 2013L, 2014L, 2015L),
  id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L),
  col1 = c(11L, 13L, 13L, 16L, 15L, 15L, 16L),
   col2 = c(10L, 14L, 12L, 13L, 11L, 16L, 17L),
col3 = c(17L, 12L, 12L, 14L, 19L, 21L, 12L),
)
> df
  year id col1 col2 col3
1 2012  1   11   10   17
2 2013  1   13   14   12
3 2014  1   13   12   12
4 2012  2   16   13   14
5 2013  2   15   11   19
6 2014  2   15   16   21
7 2015  2   16   17   12
> 

I would like to generate a cross-sectional lower triangle correlation latex table across each column pair and across all groups but I want the final table to be the average across all groups and also include the p stat. This is what I have so far using dplyr:

library(dplyr)
df %>%
  group_by(id) %>%
  summarize(COR=cor(col1,col2))

But I would like to have this for all column pairs and in my actual dataset, I have many more ids. I would like to use xtable, stargazer, or Hmisc to generate a latex correlation table that has the average corr across groups as the output and also includes the p-value. I would like my final output to look like something like this: imgur.com/a/7Jwmm8f

Erwin Rhine
  • 303
  • 2
  • 11
  • 1
    Can you show expected output format – akrun Dec 24 '20 at 17:38
  • Something like this picture: https://imgur.com/a/7Jwmm8f – Erwin Rhine Dec 24 '20 at 17:47
  • For the different id' values, do you want to create a column – akrun Dec 24 '20 at 17:50
  • No, I would like to have the average correlation across ids as the final output. So for example get the correlation between col 1 and col2 for id 1 and then repeat for id 2 and report the final output of correlation for that pair as the average between 1 and 2. In my acutal dataset I have many more ids. – Erwin Rhine Dec 24 '20 at 17:52
  • Please check the updated solution below. thanks – akrun Dec 24 '20 at 17:55

1 Answers1

2

An option would be to split by 'id' column, then apply the cor on the 'col' columns, get the elementwise + and divide by the length of unique 'id' and replace the upper.tri values to NA

out <- Reduce(`+`, lapply(split(df[3:5], df$id),
      function(x) cor(x, use = "complete.obs")))/length(unique(df$id))
out[upper.tri(out)] <- NA

-output

out
#           col1      col2 col3
#col1  1.0000000        NA   NA
#col2  0.5902554  1.000000   NA
#col3 -0.9807620 -0.569806    1

Or using tidyverse

library(dplyr)
library(purrr)
library(magrittr)
df %>% 
  select(-year) %>%
  group_split(id, .keep = FALSE) %>%
  map(cor, use = "complete.obs") %>% 
  reduce(`+`) %>% 
  divide_by(n_distinct(df$id)) %>% 
  replace(., upper.tri(.), NA)
#           col1      col2 col3
#col1  1.0000000        NA   NA
#col2  0.5902554  1.000000   NA
#col3 -0.9807620 -0.569806    1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Would this still work if I have different column names? (so no longer col1, col2, .. .but things like car, plane, stock etc. ) – Erwin Rhine Dec 24 '20 at 18:01
  • 1
    @ErwinRhine Yes, here, I am subsetting by column index instead of names in `split` i.e. `df[3:5]`. In the second solution, it is removing the columns that are not needed in `select` i.e. `year` – akrun Dec 24 '20 at 18:03
  • Thanks a lot for your solution. For some reason I'm getting NA values for all of the ouput. Would this still work if there are some NA entries in my data? – Erwin Rhine Dec 24 '20 at 23:50
  • @ErwinRhine `cor` by default have `use = "everything"`. you can specify the `use = "complete.obs"` – akrun Dec 24 '20 at 23:52
  • @ErwinRhine You can try the updated function – akrun Dec 24 '20 at 23:53
  • I tried it and I get the following error: "Error in .f(.x[[i]], ...) : no complete element pairs" I also removed all the rows that have NA values and I still get either a full NA output or the error. – Erwin Rhine Dec 25 '20 at 00:07
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226401/discussion-between-erwin-rhine-and-akrun). – Erwin Rhine Dec 25 '20 at 00:12
  • @ErwinRhine There are many options in that `use` optoin. Can you try one of those – akrun Dec 25 '20 at 00:13
  • @ErwinRhine If you can use `dput` to show a small example that gives the error, it can be tested. – akrun Dec 25 '20 at 00:14
  • I can give you the dput but its super long and big and not sure if it could be helpful. And if I only use head then you would miss the different types of grouping examples. – Erwin Rhine Dec 25 '20 at 00:29
  • @ErwinRhine Can you check the `colSums(is.na(df))`. It is possible that some elements of each column have NA and thus the complete.obs remove all the observations – akrun Dec 25 '20 at 00:30
  • I just did some manual check and I think it because I get some NA values when I do the cor functions and when you avg them out if na.rm=TRUE is not activated you get an NA final output. I tried to do the cor function for one pair as a test and calculate the mean across groups and I got the same NA value but when I used the na.rm=TRUE inside the mean, it worked. – Erwin Rhine Dec 25 '20 at 00:43
  • @ErwinRhine There is no `na.rm` argument in `cor`, thus `na.rm = TRUE` in `cor` function wouldn't work. It is only the `use` option. With `mean`, there is a `na.rm` argument. It would be good to know the output of `colSums(is.na(df))` – akrun Dec 25 '20 at 00:43
  • @ErwinRhine Lets say your dataset is `df1 <- data.frame(col1 = c(NA, 1, 2), col2 = c(2, NA, 3), col3 = c(3, 1, NA))` Here each row have a single `NA`. Thus `cor(df1, use = "complete.obs")# Error in cor(df1, use = "complete.obs") : no complete element pairs` results in error. If you use `cor(df1, use = "na.or.complete")` it returns all NA – akrun Dec 25 '20 at 00:48
  • The output of colSums(is.na(df)) is zero for every column – Erwin Rhine Dec 25 '20 at 00:49
  • @ErwinRhine That seems strange. If there is no NA, then how do you get the `Error in .f(.x[[i]], ...) : no complete element pairs"`. The same error I am getting when at least one NA is there is a row in any column – akrun Dec 25 '20 at 00:50
  • I removed all the NA and I don't get that anymore but all my final outputs are still NA.Sorry for the confusion – Erwin Rhine Dec 25 '20 at 00:53
  • @ErwinRhine May be there is some info missing. Can you post an example data in github that gives the error and send the link – akrun Dec 25 '20 at 00:53
  • @ErwinRhine How do you remove the NAs? Did you used `complete.obs` or `na.omit` – akrun Dec 25 '20 at 00:54
  • @ErwinRhine. Suppose you do `na.omit(df1)`, it returns 0 rows in the example as it will remove rows having any NA in one of the columns – akrun Dec 25 '20 at 00:55
  • @ErwinRhine I have to leave now. Sorry! will check later – akrun Dec 25 '20 at 00:58
  • Here is the link to a sample of the data: https://github.com/Hitnake/testdata/tree/main Thanks a lot of your time in advance. – Erwin Rhine Dec 25 '20 at 01:03
  • @ErwinRhine I am getting an error while downloading `Error in readRDS(url("https://github.com/Hitnake/testdata/blob/main/testdata.rds", : unknown input format` – akrun Dec 25 '20 at 17:42
  • Hi Sorry for the trouble. I uploaded it again in Rdata format. Hopefully, it works now. – Erwin Rhine Dec 26 '20 at 01:28
  • 1
    I realized there was some missing data creating the problem and the code works now. Thank you very much for your time and help. – Erwin Rhine Dec 26 '20 at 01:31
  • Hi akrun, sorry to bother you again. When I'm trying to run the code on a different dataset which has different number of time-series across id's I only get the NA values again. – Erwin Rhine Dec 17 '21 at 23:53