0

I've converted a data frame into wide format and now want to compute paired t-tests to obtain p-values. I have managed to do this for each pair of columns individually, but it's a lot more code than I feel is necessary. I'm still very new to R, data and coding generally, and couldn't easily see a solution here on Stack Overflow.

My wide data frame is:

> head(df_wide)
# A tibble: 6 x 21
Assessor `Appearance1  `Appearance2    `Aroma_1   `Aroma_2   `Flavour_1   `Flavour_2
<dbl>     <dbl>             <dbl>        <dbl>       <dbl>        <dbl>      <dbl>
1          10                10           10         10            10          10
2           6                 7            7          5             8           4

# ... with 14 more variables

I want to perform a paired T-Test over the attributes, i.e. Appearance1 and Appearance2, Aroma1 and Aroma2, etc. The 14 other variables are all <dbl> and are also attributes to be included as paired columns for the T-Test.

Ideally, the output would be a vector of just the p-values, rather than having all the information. I've managed to do that coding for individual pairs, but I wanted to know if this would be possible to do as part of performing the T-Test over multiple pairs of columns.

Here is the code I have for the first two attributes:

p_values <- c(t.test(df_wide$`Appearance1`, df_wide$`Appearance2`, paired = T)[["p.value"]],
               t.test(df_wide$`Aroma1`, df_wide$`Aroma2`, paired = T)[["p.value"]])

This creates the vector I want, but is cumbersome and error-prone. Ideally, I'd be able to perform it over all the pairs at once without needing to use column names.

I do have the original data frame in long format, if it would be easier to do it using that (EDIT: used dput() for first 20 rows instead of head():

> dput(df_test[1:20,])
structure(list(Assessor = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
Product = c("MC", "MV", "MC", "MV", "MV", "MC", "MC", "MV", "MV", "MC", "MC", "MV", "MC", "MV", "MC", "MV", "MV", "MC", "MV", "MC"),
Appearance = c(10, 10, 6, 7, 9, 6, 7, 8, 9, 8, 10, 8, 6, 6, 9, 8, 8, 8, 9, 9),
Aroma = c(10, 10, 7, 5, 9, 8, 6, 7, 5, 7, 9, 8, 6, 6, 5, 3, 6, 7, 9, 6),
Flavour = c(10, 10, 8, 4, 10, 7, 7, 6, 8, 8, 9, 10, 8, 8, 6, 8, 7, 9, 9, 8),
Texture = c(10, 10, 8, 8, 9, 6, 7, 8, 8, 8, 9, 10, 8, 8, 9, 8, 8, 9, 9, 8),
`JAR Colour` = c(3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3),
`JAR Strength Chocolate` = c(2, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3, 2),
`JAR Strength Vanilla` = c(3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3),
`JAR Sweetness` = c(2, 3, 3, 1, 3, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3),
`JAR Creaminess` = c(3, 3, 3, 3, 3, 1, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3),
`Overall Acceptance` = c(9, 10, 8, 4, 10, 5, 7, 7, 8, 8, 9, 10, 8, 8, 8, 8, 8, 9, 8, 8)),
row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))

The Product variable is the one which was used to make the paired columns in the wide format data frame. Thanks in advance.

Dee G
  • 133
  • 9

1 Answers1

0

if I understand correctly

df <- structure(list(Assessor = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
               Product = c("MC", "MV", "MC", "MV", "MV", "MC", "MC", "MV", "MV", "MC", "MC", "MV", "MC", "MV", "MC", "MV", "MV", "MC", "MV", "MC"),
               Appearance = c(10, 10, 6, 7, 9, 6, 7, 8, 9, 8, 10, 8, 6, 6, 9, 8, 8, 8, 9, 9),
               Aroma = c(10, 10, 7, 5, 9, 8, 6, 7, 5, 7, 9, 8, 6, 6, 5, 3, 6, 7, 9, 6),
               Flavour = c(10, 10, 8, 4, 10, 7, 7, 6, 8, 8, 9, 10, 8, 8, 6, 8, 7, 9, 9, 8),
               Texture = c(10, 10, 8, 8, 9, 6, 7, 8, 8, 8, 9, 10, 8, 8, 9, 8, 8, 9, 9, 8),
               `JAR Colour` = c(3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3),
               `JAR Strength Chocolate` = c(2, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3, 2),
               `JAR Strength Vanilla` = c(3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3),
               `JAR Sweetness` = c(2, 3, 3, 1, 3, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3),
               `JAR Creaminess` = c(3, 3, 3, 3, 3, 1, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3),
               `Overall Acceptance` = c(9, 10, 8, 4, 10, 5, 7, 7, 8, 8, 9, 10, 8, 8, 8, 8, 8, 9, 8, 8)),
          row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))

head(df)
#> # A tibble: 6 x 12
#>   Assessor Product Appearance Aroma Flavour Texture `JAR Colour`
#>      <dbl> <chr>        <dbl> <dbl>   <dbl>   <dbl>        <dbl>
#> 1        1 MC              10    10      10      10            3
#> 2        1 MV              10    10      10      10            2
#> 3        2 MC               6     7       8       8            2
#> 4        2 MV               7     5       4       8            3
#> 5        3 MV               9     9      10       9            3
#> 6        3 MC               6     8       7       6            3
#> # ... with 5 more variables: JAR Strength Chocolate <dbl>,
#> #   JAR Strength Vanilla <dbl>, JAR Sweetness <dbl>, JAR Creaminess <dbl>,
#> #   Overall Acceptance <dbl>

library(tidyverse)
map_df(df[-c(1:2)], ~t.test(.x ~ df$Product, paired = TRUE)$p.value)
#> # A tibble: 1 x 10
#>   Appearance Aroma Flavour Texture `JAR Colour` `JAR Strength Chocolate`
#>        <dbl> <dbl>   <dbl>   <dbl>        <dbl>                    <dbl>
#> 1      0.496 0.576       1   0.309        0.678                        1
#> # ... with 4 more variables: JAR Strength Vanilla <dbl>, JAR Sweetness <dbl>,
#> #   JAR Creaminess <dbl>, Overall Acceptance <dbl>

sapply(df[-c(1:2)], function(x) t.test(x ~ df$Product, paired = TRUE)$p.value)
#>             Appearance                  Aroma                Flavour 
#>              0.4961016              0.5763122              1.0000000 
#>                Texture             JAR Colour JAR Strength Chocolate 
#>              0.3092332              0.6783097              1.0000000 
#>   JAR Strength Vanilla          JAR Sweetness         JAR Creaminess 
#>              0.6783097              1.0000000              0.4433319 
#>     Overall Acceptance 
#>              0.7803523

Created on 2021-06-22 by the reprex package (v2.0.0)

Yuriy Saraykin
  • 8,390
  • 1
  • 7
  • 14
  • `Error in complete.cases(x, y) : not all arguments have the same length`. I don't know if this is because my long format data frame has far more rows than those I included in `head(df)`? To clarify, I want the output as a vector so I can then `cbind` it afterwards to a table which has all the attributes as the row headings. – Dee G Jun 22 '21 at 15:06
  • show an extended data example using `dput(df_test)` – Yuriy Saraykin Jun 22 '21 at 15:24
  • It's 208 rows (104 assessors), so it's too big for a comment. I could put it as an edit in the original question, but it would be very long still! – Dee G Jun 22 '21 at 16:12
  • you don't need to show all the rows. 20-30 lines are enough – Yuriy Saraykin Jun 22 '21 at 16:17
  • I've edited the original question to include the dput() output for the first 20 rows – Dee G Jun 22 '21 at 16:55
  • the error is probably due to the fact that you have gaps in the data. Try running `table(df$Product)`. the number of MC and MV must be equal – Yuriy Saraykin Jun 22 '21 at 17:07
  • I realise my error was because I hadn't put the correct data frame identifiers. I was referring to one with missing values by accident. Is there a way to amend your solution such that I have a vector of just the p-values? The code I used does this, but requires me to refer to all the pairs of columns. I included my solution in the question, but only included the first two attribute pairs to save everyone time – Dee G Jun 23 '21 at 08:57
  • does the code in the response `supply(df[-c(1:2)], function(x) t.test(x ~ df$Product, paired = TRUE)$p. value` ' help in solving the problem? it outputs the results to a vector – Yuriy Saraykin Jun 23 '21 at 09:10
  • oh, sorry, I missed that part. The only thing I need in the vector is literally the p-values themselves in the order of attributes (left to right in the data frame `df_test`), i.e. `> p_values [1] 0.275412629 0.244083447 0.044558648 0.052750753 0.007050864 0.056363225 0.879656878 0.006063213 0.495361458 0.007337490` This is what my code returns as an output, but as you can see from my code in the question, it involves coding a separate t-test on each pair. – Dee G Jun 23 '21 at 09:19
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234105/discussion-between-yuriy-saraykin-and-dee-g). – Yuriy Saraykin Jun 23 '21 at 09:58