I have a dataframe in R that looks something like this:
library(tibble)
sample <- tribble(~subj, ~session,
"A", 1,
"A", 2,
"A", 3,
"B", 1,
"B", 2,
"C", 1,
"C", 2,
"C", 3,
"C", 4)
As you can see from this example, there are a number of sessions for each subject, but subjects do not all have the same number of sessions. There are 94 rows in my real dataset (5 subjects, between 15 and 20 different sessions each).
I have another script that takes my main dataset (a set of linguistic data with detailed phonetic features for each subject in each session, with almost 200,000 rows) and filters by subject and session to create a distance matrix showing Euclidean distances between the different words. I can't replicate it here for practical reasons, but have created an example script here:
library(tibble)
data <- tribble(~subj, ~session, ~Target, ~S1C1_target, # S1C1 = syllable 1, consonant 1
~S1C1_T.Sonorant, ~S1C1_T.Consonantal, # _T. = target consonant of S1C1
~S1C1_T.Voice, ~S1C1_T.Nasal, ~S1C1_T.Degree, # .Voice/.Nasal/etc are phonetic
# properties of the target word
"A", 1, "electricity", "i", 0, 0, 0, 0, 0,
"A", 1, "hectic", "h", 0.8, 0, 1, 0, 0,
"A", 1, "pillow", "p", -1, 1, -1, 0, 0,
"A", 2, "hello", "h", -0.5, 1, 0, -1, 0,
"A", 2, "cup", "k", 0.8, 0, 1, 0, 0,
"A", 2, "exam", "e", 0, 0, 0, 0, 0,
"B", 1, "wug", "w", 0.8, 0, 1, 0, 0,
"B", 1, "wug", "w", 0.8, 0, 1, 0, 0,
"B", 1, "hug", "h", 0.8, 0, 1, 0, 0,
"B", 2, "wug", "w", -0.5, 1, 0, -1, 0,
"B", 2, "well", "w", 0.8, 0, 1, 0, 0,
"B", 2, "what", "w", 0.8, 0, 1, 0, 0)
I want to start by creating a sub-set of data for each subject in each session. Sometimes a participant has more than one token of the same word in Target
, so I create a mean value for repeated iterations here as well:
matrixA1 <- data %>% # name the data after the subj and session name/number
filter(subj == "A" & session == 1) %>%
dplyr::select(-subj, -session) %>% # leave only the numeric values + `Target`
group_by(Target) %>%
summarize_all(.funs = list(mean)) # Average across targets with more than one token
##### Calculate Euclidean distance between each phonetic property of each S1C1 target consonant
ones <- rep(1,nrow(matrixA1)) # count repeated rows
Son.mat.S1C1_T <- matrixA1$S1C1_T.Sonorant %*% t(ones) - ones %*% t(matrixA1$S1C1_T.Sonorant)
rownames(Son.mat.S1C1_T) <- matrixA1$Target
colnames(Son.mat.S1C1_T) <- matrixA1$Target
colnames(Son.mat.S1C1_T) <- paste(colnames(Son.mat.S1C1_T), "Son.S1C1_T", sep = "_")
Son.mat.S1C1_T <- Son.mat.S1C1_T^2
Con.mat.S1C1_T <- matrixA1$S1C1_T.Consonantal %*% t(ones) - ones %*% t(matrixA1$S1C1_T.Consonantal)
rownames(Con.mat.S1C1_T) <- matrixA1$Target
colnames(Con.mat.S1C1_T) <- matrixA1$Target
colnames(Con.mat.S1C1_T) <- paste(colnames(Con.mat.S1C1_T), "Con.S1C1_T", sep = "_")
Con.mat.S1C1_T <- Con.mat.S1C1_T^2
Voice.mat.S1C1_T <- matrixA1$S1C1_T.Voice %*% t(ones) - ones %*% t(matrixA1$S1C1_T.Voice)
rownames(Voice.mat.S1C1_T) <- matrixA1$Target
colnames(Voice.mat.S1C1_T) <- matrixA1$Target
colnames(Voice.mat.S1C1_T) <- paste(colnames(Voice.mat.S1C1_T), "Voice.S1C1_T", sep = "_")
Voice.mat.S1C1_T <- Voice.mat.S1C1_T^2
Nasal.mat.S1C1_T <- matrixA1$S1C1_T.Nasal %*% t(ones) - ones %*% t(matrixA1$S1C1_T.Nasal)
rownames(Nasal.mat.S1C1_T) <- matrixA1$Target
colnames(Nasal.mat.S1C1_T) <- matrixA1$Target
colnames(Nasal.mat.S1C1_T) <- paste(colnames(Nasal.mat.S1C1_T), "Nasal.S1C1_T", sep = "_")
S1C1.1A <- Son.mat.S1C1_T +
Con.mat.S1C1_T +
Voice.mat.S1C1_T +
Nasal.mat.S1C1_T
colnames(S1C1.1A) = gsub("_Son.S1C1_T", "", colnames(S1C1.1A))
This creates a matrix that looks something like this:
electricity hectic pillow
electricity 0.00 1.64 3.00
hectic 1.64 0.00 8.24
pillow 3.00 8.24 0.00
As you can see, this code is already quite big, and the real code is quite a lot longer. I know that a loop of some kind will be the best way to deal with it, but I can't figure out how to run it. What I would like it to do is this:
- For each row in
sample
, create a dataframe that hassubj
andsession
as identifiers in the name - For each of these dataframes, run the script above, from
#####
, to create a matrix for each subject and each session, like the one shown above.
To do this, I think the best way is to embed the script into a for-loop, and specify that it should be run for each row in sample
.