0

I am working on a problem set and absolutely cannot figure this one out. I think I've fried my brain to the point where it doesn't even make sense anymore.

Here is a look at the data ...

   sex     age  chol    tg    ht    wt   sbp   dbp  vldl   hdl   ldl   bmi
   <chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
 1 M        60   137    50  68.2  112.   110    70    10    53    74  2.40
 2 M        26   154   202  82.8  185.    88    64    34    31    92  2.70
 3 M        33   198   108  64.2  147    120    80    22    34   132  3.56
 4 F        27   154    47  63.2  129    110    76     9    57    88  3.22
 5 M        36   212    79  67.5  176.   130   100    16    37   159  3.87
 6 F        31   197    90  64.5  121    122    78    18    58   111  2.91
 7 M        28   178   163  66.5  167    118    68    19    30   135  3.78
 8 F        28   146    60  63    105.   120    80    12    46    88  2.64
 9 F        25   231   165  64    126    130    72    23    70   137  3.08
10 M        22   163    30  68.8  173    112    70     6    50   107  3.66
# … with 182 more rows

I must write a function, myTtest, to perform the following task:

  1. Perform a two-sample t-tests to compare the differences of a series of numeric variables between each level of a classification variable

  2. The first argument, dat, is a data frame

  3. The second argument, classVar, is a character vector of length 1. It is the name of the classification variable, such as 'sex.'

  4. The third argument, numVar, is a character vector that contains the name of the numeric variables, such as c("age", "chol", "tg"). This means I need to perform three t-tests to compare the difference of those between males and females.

  5. The function should return a data frame with the following variables: Varname, F.mean, M.mean, t (for t-statistics), df (for degrees of freedom), and p (for p-value).

I should be able to run this ...

myTtest(dat = chol, classVar = "sex", numVar = c("age", "chol", "tg")

... and then get the data frame to appear.

Any help is greatly appreciated. I am pulling my hair out over this one! As well, as noted in my comment below, this has to be done without Tidyverse ... which is why I'm having so much trouble to begin with.

fiverings84
  • 153
  • 6

1 Answers1

-1

The intuition for this solution is that you can loop over your dependent variables, and call t.test() in each loop. Then save the results from each DV and stack them together in one big data frame.

I'll leave out some bits for you to fill in, but here's the gist:

First, some example data:

set.seed(123)
n <- 20
grp <- sample(c("m", "f"), n, replace = TRUE)
df <- data.frame(grp = grp, age = rnorm(n), chol = rnorm(n), tg = rnorm(n))

df
   grp        age        chol          tg
1    m  1.2240818  0.42646422  0.25331851
2    m  0.3598138 -0.29507148 -0.02854676
3    m  0.4007715  0.89512566 -0.04287046
4    f  0.1106827  0.87813349  1.36860228
5    m -0.5558411  0.82158108 -0.22577099
6    f  1.7869131  0.68864025  1.51647060
7    f  0.4978505  0.55391765 -1.54875280
8    f -1.9666172 -0.06191171  0.58461375
9    m  0.7013559 -0.30596266  0.12385424
10   m -0.4727914 -0.38047100  0.21594157

Now make a container that each of the model outputs will go into:

fits_df <- data.frame()

Loop over each DV and append the model output to fits_df each time with rbind:

for (dv in c("age", "chol", "tg")) {
  frml <- as.formula(paste0(dv, " ~ grp")) # make a model formula: dv ~ grp
  fit <- t.test(frml, two.sided = TRUE, data = df) # perform the t-test

  # hint: use str(fit) to figure out how to pull out each value you care about
  fit_df <- data.frame(
    dv = col,
    f_mean = xxx,
    m_mean = xxx,
    t = xxx,
    df = xxx,
    p = xxx
  )
  fits_df <- rbind(fits_df, fit_df)
}

Your output will look like this:

fits_df
    dv      f_mean      m_mean      t     df         p
1  age -0.18558068 -0.04446755 -0.297 15.679 0.7704954
2 chol  0.07731514  0.22158672 -0.375 17.828 0.7119400
3   tg  0.09349567  0.23693052 -0.345 14.284 0.7352112

One note: When you're pulling out values from fit, you may get odd row names in your output data frame. This is due to the names property of the various fit attributes. You can get rid of these by using as.numeric() or as.character() wrappers around the values you pull from fit (for example, fit$statistic can be cleaned up with as.character(round(fit$statistic, 3))).

andrew_reece
  • 20,390
  • 3
  • 33
  • 58