-1

I'm trying to use a custom function called Inference() as seen in the code below. There's no documentation for the function, but it is from my DASI class in Coursera. According to the feedback I have received, I am using the function properly. I'm trying to do a two-sided hypothesis test between my class variable and my wordsum variable, that is, between the two means of the categories low class and working class. So, the average wordsum for working class - average wordsum for lower class. However, the function/R/R Studio keep insisting I do an ANOVA test. This doesn't work for me since I'm trying to reject the null, and create a confidence interval between the difference of two independent means. I've looked at the function, but as I'm no R expert, I don't see anything out of the ordinary. Any help is greatly appreciated.

Code:

load(url("http://bit.ly/dasi_gss_ws_cl"))
source("http://bit.ly/dasi_inference")

summary(gss)
by(gss$wordsum, gss$class, mean)
boxplot(gss$wordsum ~ gss$class)

gss_clean = na.omit(subset(gss, class == "WORKING" | class =="LOWER"))

inference(y = gss_clean$wordsum, x = gss_clean$class, est = "mean", type = "ht", 
          null = 0, alternative = "twosided", method = "theoretical")

Returns:

Response variable: numerical, Explanatory variable: categorical
Error: Use alternative = 'greater' for ANOVA or chi-square test.
In addition: Warning message:
Ignoring null value since it's undefined for ANOVA.
Jason T. Eyerly
  • 183
  • 1
  • 6
  • 18
  • Why is this being voted down? This is a legitimate question. – Jason T. Eyerly Oct 05 '14 at 01:19
  • I'm not one of the down-voters, but I can guess. When I read this. it didn't make any sense. In particular the sentence: "I'm trying to do a two-sided hypothesis test between my class variable and my wordsum variable" didn't really specify any hypothesis at least as I understand the term. Do you mean you want to model and test an interaction term? (At the moment the description of these entities is too vague to permit further guesswork. And that source file is huge. I think you should be corresponding with the members of your class.) – IRTFM Oct 05 '14 at 01:42
  • 1.8Mb is huge? As I stated, according to my classmates I'm using it exactly as I should. – Jason T. Eyerly Oct 05 '14 at 04:26
  • The source file, not the data file. It's not our responsibility to page through your code to find the error. – IRTFM Oct 05 '14 at 05:21
  • I believe the comments are for positive contributions that help arrive at a solution. Let's keep it that way instead of worrying about responsibilities, shall we? – Jason T. Eyerly Oct 05 '14 at 17:22
  • You asked why it was being voted down, @BondedDust gave an explanation; you responded, they responded to your response. I agree that further discussion (if any) should be moved to Meta. – Ben Bolker Oct 05 '14 at 17:48

1 Answers1

2

You need

gss_clean <- droplevels(gss_clean)

Then your inference() call works:

Response variable: numerical, Explanatory variable: categorical
Difference between two means
Summary statistics:
n_LOWER = 41, mean_LOWER = 5.0732, sd_LOWER = 2.2404
n_WORKING = 407, mean_WORKING = 5.7494, sd_WORKING = 1.8652
Observed difference between means (LOWER-WORKING) = -0.6762
H0: mu_LOWER - mu_WORKING = 0 
HA: mu_LOWER - mu_WORKING != 0 
Standard error = 0.362 
Test statistic: Z =  -1.868 
p-value =  0.0616 

The problem is that unless you drop the unused levels of the factor, the internal machinery of inference() thinks that you have a 4-level categorical variable, and it can't do a t-test or equivalent 2-category test: it has to do a one-way ANOVA or analogue.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thank you so much! This was driving me insane! May I ask why that is? I thought by creating a new "clean" dataset using subset, those levels were automatically dropped? – Jason T. Eyerly Oct 05 '14 at 17:23
  • Let me rephrase that. How would I properly create a table/dataset with just the working and lower class, and not have the blanks? – Jason T. Eyerly Oct 05 '14 at 17:36
  • 2
    the answer is to use `droplevels()` after subsetting, just as I showed. I would love it if there were an option to `subset()` that did this automatically, but the R developers have never agreed with me about this: https://stat.ethz.ch/pipermail/r-devel/2010-August/058120.html – Ben Bolker Oct 05 '14 at 17:47