1

I have a dataframe like this

df <- structure(list(ID = c(243, 292, 317, 388, 398, 404, 463, 473, 
842, 844, 858, 862, 869, 871, 879, 888), Zone = c(1, 1, 1, 1, 
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), Gen = c("Male", "Male", 
"Other Gender Identity", "Male", "Male", "Male", "Male", "Female", 
"Female", "Male", "Female", "Male", "Male", "Male", "Male", "Female"
), Month_Inc = c("< $1,500", "< $1,500", "< $1,500", "$1,500 - $1,999", 
"$1,500 - $1,999", "< $1,500", "< $1,500", "< $1,500", "$1,500 - $1,999", 
"$2,000 - $2,499", "$1,500 - $1,999", "< $1,500", "$2,500 - $2,999", 
"< $1,500", "< $1,500", "< $1,500")), row.names = c(NA, -16L), class = c("tbl_df", 
"tbl", "data.frame"))

What I need to do is to test if there is a statistical difference for the percentage of females in the two zones. I need to test this for the income level too.

I need to do a t-test for Gen~Zone Ho = %female=%male for the two zones H1 = %female != %male for the two zones

Similarly, for the Month_Inc ~ Zone too!

I tried the following code

t.test(Gen ~ Zone, mu = 0, alt = "two.sided",
       conf=  0.95, paired = FALSE, ver.equal = FALSE, 
       data= df)

however, I am not getting anywhere! How do I correct it? I am thinking of something to do with the data type issue but I can't be certain.

Thanks for your help!

neilfws
  • 32,751
  • 5
  • 50
  • 63
Tathagato
  • 348
  • 1
  • 11
  • What do you mean by "not getting anywhere" ? Unexpected results, an error message? Note that a t-test tests for a difference between the means of groups, and you are supplying `Gen` (a character) where a numeric variable is expected. – neilfws Oct 18 '22 at 03:47
  • Yes I am getting an error message as follows: Warning: is.na() applied to non-(list or vector) of type 'language'Warning: argument is not numeric or logical: returning NAError in var(x) : Calling var(x) on a factor x is defunct. Use something like 'all(duplicated(x)[-1L])' to test for a constant vector. – Tathagato Oct 18 '22 at 03:50
  • I changed the data type and now getting: Error in var(y) : is.atomic(x) is not TRUE – Tathagato Oct 18 '22 at 04:02

1 Answers1

2

There is a statistical issue here that you're ignoring. Note, you're investigating a difference in the proportion of females between two areas. I would consider Fisher's exact test, which is a convenient non-parametric test when the sample sizes are not very large. In R, the prop.test() function should work well. First, we feed the function a vector of successes, which is just the count of the number of females within each zone. The next argument is a vector of sample sizes.

# Let's calculate the counts for the different zone-gender pairs

df |>
  group_by(Zone, Gen) |>
  summarize(Total = n())

# A tibble: 5 × 3
# Groups:   Zone [2]
   Zone Gen                   Total
  <dbl> <chr>                 <int>
1     1 Female                    1
2     1 Male                      6
3     1 Other Gender Identity     1
4     2 Female                    3
5     2 Male                      5

Since I'm working with a subset of your data, I can look at the counts directly and feed them into the prop.test() function. Here, we see 1 female in zone 1 and 3 females in zone 2.

prop.test(x = c(1, 3), n = c(8, 8), p = NULL, alternative = "two.sided", correct = TRUE)

    2-sample test for equality of proportions with continuity correction

data:  c(1, 3) out of c(8, 8)
X-squared = 0.33333, df = 1, p-value = 0.5637
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.7812791  0.2812791
sample estimates:
prop 1 prop 2 
 0.125  0.375

Please ignore any warning messages about the Chi-squared approximation. Since we're working with very small cell sizes, the estimates will be quite poor. I wouldn’t worry about it.

If, on the other hand, you’re interested in whether the population proportions of men and women are not equal, then you can perform this test individually within each respective zone.

Now, let's talking about individual income. You're supplying R with character values where numeric ones are required. To achieve something estimable with a standard t-test, we must make a sensible compromise. Say you want to estimate the mean difference in income between two discrete/independent groups. Opinions may differ, but using the midpoint between the interval is not uncommon. For example, the midpoint between $1,500 – $1,999 is $1,750. You'd do this for each individual observation. Although this is only an approximation, you can now calculate a central tendency.

Thomas Bilach
  • 591
  • 2
  • 16
  • Thank you! This is really helpful. One specific clarification in your code `n = c(8, 8)`: does this indicate the number of observations for each zone? In my case that would be 171 and 544, respectively in the actual data. – Tathagato Oct 18 '22 at 06:02
  • Correct. Again, if you’re interested in testing for a difference in the population proportions between men and women, by zone, then make sure you adjust the sample sizes appropriately. – Thomas Bilach Oct 18 '22 at 14:31
  • Also, do you see why a t-test is inappropriate given the categorical grouping of income? We don’t have a single numeric value for each person. – Thomas Bilach Oct 18 '22 at 14:38