1

I am trying to move from long format data to wide format in order to do some correlation analyses.

But, dcast seems to create to rows for the first subject and splits the data across those two rows filling the created empty cells with NA.

The first 2 subjects were being duplicated when I was using alphanumeric subject codes, I went to numeric subject numbers and that has to down to only the first subject being duplicated.

the first few lines of the long format data frame:

       Subject Age Gender R_PTA L_PTA BE_PTA Avg_PTA L_Aided_SII R_Aided_SII Best_Aided_SII L_Unaided_SII R_Unaided_SII Best_Unaided_SII L_SII_Diff R_SII_Diff
1       1  74      M 48.33 53.33  48.33   50.83          31          42             42            14            25               25         17         17
2       2  77      F 36.67 36.67  36.67   36.67          73          67             73            44            43               44         29         24
3       3  72      F 45.00 41.67  41.67   43.33          42          34             42            35            28               35          7          6
4       4  66      F 36.67 36.67  36.67   36.67          66          76             76            44            44               44         22         32
5       5  38      F 41.67 46.67  41.67   44.17          48          58             58            23            29               29         25         29
6       6  65      M 35.00 43.33  35.00   39.17          46          60             60            32            46               46         14         14
  Best_SII_Diff       rSII MoCA_Vis MoCA_Nam MoCA_Attn MoCA_Lang MoCA_Abst MoCA_Del_Rec MoCA_Ori MoCA_Tot   PNT Semantic   Aided PNT_Prop PNT_Prop_Mod
1            17 -0.4231157        5        3         6         2         2            2        6       26 0.971    0.029 Unaided    0.971        0.983
2            29  1.2739255        3        3         5         0         2            2        5       20 0.954    0.046 Unaided    0.960        0.966
3             7 -1.2777889        4        2         5         2         2            5        6       26 0.966    0.034 Unaided    0.960        0.982
4            32  1.5959701        5        3         6         3         2            5        6       30 0.983    0.017 Unaided    0.983        0.994
5            29  0.9492167        4        2         6         3         1            3        6       25 0.983    0.017 Unaided    0.983        0.994
6            14 -0.2936395        4        2         6         2         2            2        6       24 0.989    0.011 Unaided    0.989        0.994
  PNT_S_Wt PNT_P_Wt
1    0.046    0.041
2    0.073    0.033
3    0.045    0.074
4    0.049    0.057
5    0.049    0.057
6    0.049    0.057

Creating varlist:

varlist <- list(colnames(subset(PNT_Data_All2, ,c(18:27,29:33))))

My dcast command:

Data_Wide <- dcast(as.data.table(PNT_Data_All2),Subject + Age + Gender + R_PTA + L_PTA + BE_PTA + Avg_PTA + L_Aided_SII + R_Aided_SII + Best_Aided_SII + L_Unaided_SII + R_Unaided_SII + Best_Unaided_SII + L_SII_Diff + R_SII_Diff + Best_SII_Diff + rSII ~ Aided, value.var=varlist)

The resulting first few lines of the wide format:

  Subject Age Gender R_PTA L_PTA BE_PTA Avg_PTA L_Aided_SII R_Aided_SII Best_Aided_SII L_Unaided_SII R_Unaided_SII Best_Unaided_SII L_SII_Diff R_SII_Diff
1:       1  74      M 48.33 53.33  48.33   50.83          31          42             42            14            25               25         17         17
2:       1  74      M 48.33 53.33  48.33   50.83          31          42             42            14            25               25         17         17
3:       2  77      F 36.67 36.67  36.67   36.67          73          67             73            44            43               44         29         24
4:       3  72      F 45.00 41.67  41.67   43.33          42          34             42            35            28               35          7          6
5:       4  66      F 36.67 36.67  36.67   36.67          66          76             76            44            44               44         22         32
6:       5  38      F 41.67 46.67  41.67   44.17          48          58             58            23            29               29         25         29

Notice Subject 1 has 2 entries. All of the other subjects seem correct

Is this a problem with my command/arguments? A bug in dcast?

Edit 1: Through the process of elimination, the extra entries only appear when I include the "rSII" variable. This is a variable that is calculated from a previous step in the script:

PNT_Data_All$rSII <- stdres(lm(Best_Aided_SII ~ Best_Unaided_SII, data=PNT_Data_All))

PNT_Data_All <- PNT_Data_All[, colnames(PNT_Data_All)[c(1:17,34,18:33)]]

Is there something about that calculated variable that would mess up dcast for some subjects?

Edit 2 to add my workaround:

I ended up rounding the calculated variable to 3 digits after the decimal and that solved the problem. Everything is casting correctly now with no duplicates.

PNT_Data_All$rSII <- format(round(stdres(lm(Best_Aided_SII ~ Best_Unaided_SII, data=PNT_Data_All)),3),nsmall=3)
JLC
  • 661
  • 7
  • 16
  • I've found that best practice with dcast is to only use those variables I plan on casting. Otherwise you may get dupes. Subset the columns you are going to use first THEN pass that as data to dcast. – Brandon Bertelsen Feb 01 '17 at 05:11
  • @Brandon I did subset the data first and was attempting to case all of the variables in the dataframe. But it was still messing up. I added in my workaround to the original question/post. It seems extended digits after the decimal of the calculated variable was causing a problem. – JLC Feb 01 '17 at 05:30
  • I'm trying to reproduce your problem but I can't find a definition of `varlist` used in the call to `dcast()`. – Uwe Apr 18 '17 at 08:04
  • @UweBlock , I've added in a call to create varlist in the original post. – JLC Apr 18 '17 at 13:32

0 Answers0