I have a data frame in R. The first two columns are my summed frequencies of "Yes" and "No." The final 3 columns are categorical factors, each with a label.
I am trying to make a 4-D contingency table from this format and I have no idea where to start the process.
My data looks like this:
Sold Unsold Label1 Label2 Label3
1 3330 32102 AdvancedShopper: Y TERR_USED: Non-TREE SPINOFF: N
2 2735 30691 HSEHLD_INCDT_BAND: 0 CLM_FREE_INCDT_CT: 0 SPINOFF: N
3 3350 29485 TERR_USED: Non-TREE CLM_FREE_INCDT_CT: 0 SPINOFF: N
4 3864 28657 SingleMulti: N TERR_USED: Non-TREE SPINOFF: N
5 2691 26355 TERR_USED: Non-TREE HSEHLD_INCDT_BAND: 0 CLM_FREE_INCDT_CT: 0
6 2396 25884 TERR_USED: Non-TREE HSEHLD_INCDT_BAND: 0 SPINOFF: N
7 2738 25172 Channel: Owned Agency TERR_USED: Non-TREE SPINOFF: N
8 3203 24425 TERR_USED: Non-TREE FULL_CVG_FLG: Y SPINOFF: N
9 2781 24163 SingleMulti: N CLM_FREE_INCDT_CT: 0 SPINOFF: N
10 1950 22371 AdvancedShopper: Y CLM_FREE_INCDT_CT: 0 SPINOFF: N
11 2644 21528 TERR_USED: Non-TREE FULL_CVG_FLG: N SPINOFF: N
12 2278 21736 Channel: Owned Agency SingleMulti: N SPINOFF: N
13 2324 21648 SingleMulti: N HSEHLD_INCDT_BAND: 0 CLM_FREE_INCDT_CT: 0
14 3108 20780 Channel: Prudent TERR_USED: Non-TREE SPINOFF: N
15 2491 21216 TERR_USED: Non-TREE PRIOR_BI: High SPINOFF: N
I began with 8 columns: 3 Categories + 3 Values for each category + (1) number of Quotes written, and (1) number of sales on those Quotes = 8. I concatenated the respective category and value strings to form the three columns above. I have 19 categories, each category has its own number of attributes between 2 and 6. Sorting will put the respective columns in order, but not necessarily form the 4-D boxes for each combination of 3 categories and the respective Yes (Sold) and No (Unsold). The mean rate of sales is 11.4% and I would like to get the frequencies into shape to run Chi2 tests on these four-way contingencies to identify the combinations that create the strongest outliers from the mean. I have 80046 combinations, essentially (19 choose 3) with each of those three choices having their respective buckets, for example Row 1 is from a 4-D table of 16 cells (2 attr x 2 attr x 2 attr x [Y,N]), Row 2 is from a 4-D table of 96 cells (4 attr x 6 attr x 2 attr x [Y,N])... etc.
I'm unsure how to get this data into a format to start using the table()
and xtabs()
functions and thus the chi2.test
. (Should I go back to the step before I concatenated the categories and values?)
I new to R, but I know it's supposed to be much better at programming for these large arrays. I don't have access to SPSS, but I do have access to SAS (also new in that) if there's something easier to try there...
Any sort of direction is a big help.
------------------- Desired output? reply ---------------------
Well, the table
command takes a data.frame from
Category 1 Category 2 Category 3 Y/N
...into contingency table format, right? But I already have my Yes's and No's in a frequency format with the three categories listed as such.
Do I need to change to this single instance format and explode my 80046 row table into millions of rows? Or is there a way to initiate the table
command with the frequencies of Yes and No already tabulated in two columns?