1

I am having difficulty figuring out how to convert some wide data into long format. I have three columns of string data (A1_R00_FillerNP, A1_R01_ADV, and A1_R02_1stEmbV) which I would like to melt into one column (WordCountRegion) in such a way that for each Subject and item the correct word will be mapped from one of these three columns to the new, WordCountRegion column.

Using a simple melt approach as in the code below gets me part of the way there:

(Note: the strange characters in the df are inconsequential - please ignore them here)

df <- structure(list(Subject = c(101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L), condition = structure(c(2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 
3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L), .Label = c("P", "R", 
"S"), class = "factor"), item = c(101L, 102L, 103L, 101L, 102L, 
103L, 101L, 102L, 103L, 101L, 102L, 103L, 101L, 102L, 103L, 101L, 
102L, 103L), A1_R00_FillerNP = structure(c(3L, 2L, 1L, 3L, 2L, 
1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L), .Label = c("SÌÇna d_r allvarliga konsekvenser", 
"SÌÇna d_r fina _ppeltr_d", "SÌÇna d_r gamla skottk_rror"
), class = "factor"), A1_R01_ADV = structure(c(1L, 1L, 2L, 1L, 
1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L), .Label = c("alltid", 
"f_rresten"), class = "factor"), A1_R02_1stEmbV = structure(c(3L, 
2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 
1L), .Label = c("diskuterade", "stod", "tv_ttade"), class = "factor"), 
    RT = c(0L, 149L, 247L, 272L, 171L, 245L, 317L, 0L, 233L, 
    0L, 981L, 750L, 272L, 171L, 334L, 317L, 0L, 233L), Region = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 
    3L, 3L), .Label = c("R00", "R01", "R02"), class = "factor"), 
    RegionType = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 
    1L, 3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("1stEmbV", 
    "ADV", "FillerNP"), class = "factor"), DV = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L), .Label = c("FIRST_FIXATION_DURATION", "GAZE_DURATION"
    ), class = "factor")), .Names = c("Subject", "condition", 
"item", "A1_R00_FillerNP", "A1_R01_ADV", "A1_R02_1stEmbV", "RT", 
"Region", "RegionType", "DV"), class = "data.frame", row.names = c(NA, 
-18L))

df1 = melt(df, measure.vars = c("A1_R00_FillerNP","A1_R01_ADV","A1_R02_1stEmbV"), var = "WordCountRegion")

The problem is that this code incorrectly breaks the words across regions. I end up with output like the following, where words do not break as specified by Region and instead extend across values of Region, as can be seen by WordCountRegion and value. It is clear that if I am going to use this, then I need some sort of additional specification so that melt() will be able to break the data correctly. I'm just not sure how to do this (or if it can be done within melt()).

   Subject condition item  RT Region RegionType                      DV WordCountRegion                             value
1      101         R  101   0    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
2      101         P  102 149    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
3      101         S  103 247    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
4      101         R  101 272    R01        ADV FIRST_FIXATION_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
5      101         P  102 171    R01        ADV FIRST_FIXATION_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
6      101         S  103 245    R01        ADV FIRST_FIXATION_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
7      101         R  101 317    R02    1stEmbV FIRST_FIXATION_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
8      101         P  102   0    R02    1stEmbV FIRST_FIXATION_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
9      101         S  103 233    R02    1stEmbV FIRST_FIXATION_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
10     101         R  101   0    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
11     101         P  102 981    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
12     101         S  103 750    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
13     101         R  101 272    R01        ADV           GAZE_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
14     101         P  102 171    R01        ADV           GAZE_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
15     101         S  103 334    R01        ADV           GAZE_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
16     101         R  101 317    R02    1stEmbV           GAZE_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
17     101         P  102   0    R02    1stEmbV           GAZE_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
18     101         S  103 233    R02    1stEmbV           GAZE_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
19     101         R  101   0    R00   FillerNP FIRST_FIXATION_DURATION      A1_R01_ADV                            alltid
20     101         P  102 149    R00   FillerNP FIRST_FIXATION_DURATION      A1_R01_ADV                            alltid
21     101         S  103 247    R00   FillerNP FIRST_FIXATION_DURATION      A1_R01_ADV                         f_rresten

Is there a way that I could modify melt() to get these to line up/match by Region, as in the sample below:

   Subject condition item  RT Region RegionType                      DV WordCountRegion                             value
1      101         R  101   0    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
2      101         P  102 149    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
3      101         S  103 247    R00   FillerNP FIRST_FIXATION_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser
4      101         R  101 272    R01        ADV FIRST_FIXATION_DURATION A1_R01_ADV                                 alltid
5      101         P  102 171    R01        ADV FIRST_FIXATION_DURATION A1_R01_ADV                                 alltid
6      101         S  103 245    R01        ADV FIRST_FIXATION_DURATION A1_R01_ADV                              f_rresten
7      101         R  101 317    R02    1stEmbV FIRST_FIXATION_DURATION A1_R02_1stEmbV                           tv_ttade
8      101         P  102   0    R02    1stEmbV FIRST_FIXATION_DURATION A1_R02_1stEmbV                               stod
9      101         S  103 233    R02    1stEmbV FIRST_FIXATION_DURATION A1_R02_1stEmbV                        diskuterade
10     101         R  101   0    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP       SÌÇna d_r gamla skottk_rror
11     101         P  102 981    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP          SÌÇna d_r fina _ppeltr_d
12     101         S  103 750    R00   FillerNP           GAZE_DURATION A1_R00_FillerNP SÌÇna d_r allvarliga konsekvenser

Or, if I am using the wrong function altogether, could someone please point me towards a better solution? Perhaps I need something that does actual lookups?

D T
  • 99
  • 1
  • 12
  • 1
    It's hard to follow exactly what you're asking due to the wrapping, etc., in your code snippets. How should melt know how Region maps to the three columns you're stacking? Is the problem just that the result you get in df1 has Region labeled incorrectly? If so, could you just recreate region in the melted dataframe by looking for R00, R01, etc in the values of WordCountRegion? – eamcvey Dec 05 '14 at 15:22
  • Hi. I changed some formatting and edited to hopefully make the issue more clear. Perhaps it simply is not possible for me to use melt() since, as you say, it has to know where the breaks should occur. Is there a lookup function that would work better? – D T Dec 05 '14 at 15:29
  • You could just filter the dataset after melting to keep only rows when the region in `WordCountRegion` matches `Region`. If `WordCountRegion` always has the region code in elements 4-6 of the string, maybe `subset(df1, Region == substr(WordCountRegion, 4, 6))`. Alternatively something like `subset(df1, Region == gsub("^.*(R[0-9]+).*$", "\\1", WordCountRegion))`. – aosmith Dec 05 '14 at 16:41
  • @aosmith and eamcvey. Thanks very much for the suggestions. One issue is that I have drastically simplified the dataset here. There are actually additional regions that I want to keep in the data set but that I will not use for every analysis. Also I'm not entirely sure that my melt code is collapsing the data in a way that preserves subject and item id.vars information (though it seems that it should be). if it is doing this, then aosmith's first approach might work best. I must test... – D T Dec 05 '14 at 18:32

1 Answers1

1

You could create a little lookup table, merge it in, then use it to filter your melted dataframe, and I believe this gives you the result you're looking for.

region_df <- data.frame(var = c("A1_R00_FillerNP","A1_R01_ADV","A1_R02_1stEmbV"), 
  Region = c('R00','R01','R02'))

df2 <- merge(df1, region_df)
df3 <- subset(df2, var==WordCountRegion)
eamcvey
  • 645
  • 6
  • 18
  • This seems to work perfectly. And it allows for different sized var names (which is actually the case in my full dataset). I still have some hesitation regarding how precise the original (overextending) code matches the correct subset of the data (pre-filtering). But I checked two regions for two subjects in the filtered data with known to be correct values and did not see any issues at all. Thanks to everyone for helping. – D T Dec 05 '14 at 21:15