0

With the following data, I am trying to calculate the Chi Square and Bonferroni lower and upper Confidence intervals. The column "Data_No" identifies the dataset (as calculations needs to be done separately for each dataset).

Data_No    Area    Observed
   1        3353    31
   1        2297    2
   1        1590    15
   1        1087    16
   1        817     2
   1        847     10
   1        1014    28
   1        872     29
   1        1026    29
   1        1215    21
   2        3353    31
   2        2297    2
   2        1590    15
   3        1087    16
   3        817     2

The code I used is

        library(dplyr) 
        setwd("F:/GIS/July 2019/") 
        total_data <- read.csv("test.csv") 
        result_data <- NULL 
        for(i in unique(total_data$Data_No)){ 
        data <- total_data[which(total_data$Data_No == i),] data <- data %>%
        mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/2*count(Data_No), 
Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')

And the error message I get is:

Error in UseMethod("summarise_") : no applicable method for 'summarise_' applied to an object of class "c('integer', 'numeric')"

The column that I am calling "Alpha" is the alpha value of 0.05/2k, where K is the number of categories - in my example, I have 10 categories ("Data_No" column) for the first dataset, so "Alpha" needs to be 0.05/20 = 0.0025, and it's corresponding Z value is 2.807. The second dataset has 3 categories (so 0.05/6) and the third has 2 categories (0.05/4) in my example table (Data_No" column). Using the values from the newly calculated "Alpha" column, I then need to calculate the ZValue column (Zvalue = qnorm(Alpha,lower.tail=FALSE)) which I then use to calculate the lower and upper confidence intervals.

From the above data, here are the results that I should get, but note that I have had to manually calculate Alpha column and Zvalue, rather than insert those calculations within the R code:

Data_No Area    Observed    RelativeArea    Alpha   Z value lower   upper
    1   3353    31          0.237           0.003   2.807   0.092   0.247
    1   2297    2           0.163           0.003   2.807   -0.011  0.033
    1   1590    15          0.113           0.003   2.807   0.025   0.139
    1   1087    16          0.077           0.003   2.807   0.029   0.146
    1   817     2           0.058           0.003   2.807   -0.011  0.033
    1   847     10          0.060           0.003   2.807   0.007   0.102
    1   1014    28          0.072           0.003   2.807   0.078   0.228
    1   872     29          0.062           0.003   2.807   0.083   0.234
    1   1026    29          0.073           0.003   2.807   0.083   0.234
    1   1215    21          0.086           0.003   2.807   0.049   0.181
    2   3353    31          0.463           0.008   2.394   0.481   0.811
    2   2297    2           0.317           0.008   2.394   -0.027  0.111
    2   1590    15          0.220           0.008   2.394   0.152   0.473
    3   1087    16          0.571           0.013   2.241   0.723   1.055
    3   817     2           0.429           0.013   2.241   -0.055  0.277

Please note that I only included some of the columns generated from the code.

JayPeerachai
  • 3,499
  • 3
  • 14
  • 29
EleMan
  • 43
  • 1
  • 7
  • Hi Eleman, I think there needs to be a additional column 'NO_OF_CATEGORIES' for each Data_No. Instead of passing Data_No, we can then pass this 'NO_OF_CATEGORIES' and get the correct value in Alpha. – ealbsho93 Jul 23 '19 at 17:36

2 Answers2

0
# You need to check the closing bracket for lower c.f. sqrt value. Following code should work.

data <- read.csv("test.csv") 
data <- data %>% mutate(RelativeArea =
                          Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
                          Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
                          Observed/sum(Observed), lower =
                          APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
                                           APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))



#Answer to follow-up question.
#Sample Data
Data_No   Area   Observed
1         3353    31
1         2297    2
2         1590    15
2         1087    16

#Code to run
total_data <- read.csv("test.csv")
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),]
data <- data %>% mutate(RelativeArea =
                          Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
                          Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
                          Observed/sum(Observed), lower =
                          APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
                                           APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))

result_data <- rbind(result_data,data)
}

write.csv(result_data,file='final_result.csv')
ealbsho93
  • 141
  • 4
  • Thank you, that worked nicely. Sorry one more question - I have 60 such tables. Instead of saving each one as a .csv and reading them into R individually, if I were to copy and paste all those values for each of the 60 tables into the same tab in an excel spreadsheet, would R be able to perform the same calculation on all tables separately? Thanks. – EleMan Jul 17 '19 at 13:10
  • Hi Gaius, Thank you for acknowledging the answer.It would be great if you could approve it as the answer. For your followup question: You will need one more variable in test.csv like data_no which identifies which data-set you are working on. Using that variable you can run a for loop on all your 60 data-sets. – ealbsho93 Jul 17 '19 at 18:44
  • brilliant - that worked perfectly - thank you so much. I have approved the answer by clicking on the green tick! I am trying to understand what the <- NULL and for(i in unique ((total_data$Data_No)){ data <- total_data[which(total_data$Data_No == i),]) does and how it works. – EleMan Jul 18 '19 at 07:24
  • I realize that with 60 tables, 2.394 is not constant and hence I don't get the correct result. I've tried to include a code to calculate the z Value which is Alpha/2*K where K is the number of categories. So in your example, I'm trying to count the number of categories using the column "Data_No" (both of which have two categories 1,1 and 2,2 in this example) so the Alpha value is 0.05/2*2 = 0.0125 and the corresponding Z value = 2.242 which I need to use for the last parts of the calculation for these two tables. Below is the code I used. – EleMan Jul 18 '19 at 15:43
  • hey eleman can you add this code in your original question? i cant make sense of it without the formatting. I think this shouldnt be a big issue once i know whats happening. – ealbsho93 Jul 18 '19 at 17:55
  • Hi @ealbsho93 I have edited my original question and code to update my question - hope it makes sense - thanks. – EleMan Jul 19 '19 at 16:43
  • I have attached a separate answer for them. Thanks. – ealbsho93 Jul 21 '19 at 15:49
0
       #Issue in calculating Alpha. I have updated the code.    
       library(dplyr) 
       setwd("F:/GIS/July 2019/") 
       total_data <- read.csv("test.csv") 
       #Creating the NO_OF_CATEGORIES column based on your question.
       total_data$NO_OF_CATEGORIES <- 0
       total_data[which(total_data$Data_No==1),]$NO_OF_CATEGORIES <- 10
       total_data[which(total_data$Data_No==2),]$NO_OF_CATEGORIES <- 3
       total_data[which(total_data$Data_No==3),]$NO_OF_CATEGORIES <- 2


       #Actual code


     result_data <- NULL 

for(i in unique(total_data$Data_No)){ 
  data <- total_data[which(total_data$Data_No == i),] 
  data <- data %>%
    mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/(2*(unique(data$NO_OF_CATEGORIES))), 
           Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
  result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')
ealbsho93
  • 141
  • 4
  • Thanks.The code for column "Alpha = 0.05/2*(length(unique(total_data$Data_No)))" still does not know that the number within Data_No (which also identifies a 'dataset') needs to be counted. That count refers to the number of categories. The next column is 0.05/2*count of categories (by dataset), under Data_No. Right now, Alpha column generates the same number and not does not count each dataset separately, which is not right. So I guess the code needs to be modified to say count how many rows have the same number in Data_No and use that to multiple by 2 (or 0.05/2*count number of categories)? – EleMan Jul 22 '19 at 10:25
  • Hi Eleman, I have edited code based on what I understood what your comment was. If it works the way you want it does please acknowledge it. Thanks :) – ealbsho93 Jul 22 '19 at 17:54
  • Thanks - unfortunately it does not work, so can you please give me your email id (or please may I request you to send me an email at gaiuswilsonin@gmail.com) and I will email you the data and try and explain again what needs to be done and what the outcome should look like. At the moment, the Alpha column has only one value and in turn generates the same Z value and so it's wrong. Or should I just edit the original question, table, outcome and expected outcome?Thanks @ealbsho93 – EleMan Jul 23 '19 at 05:28
  • Please edit the original question , table , out come and expected outcome. If you want to keep this question for reference, you can create a new question too. – ealbsho93 Jul 23 '19 at 05:40
  • I've just done that, hopefully it makes sense now, as I've also shown what the expected outcome should be? – EleMan Jul 23 '19 at 06:55
  • Hi Eleman, I have added an additional column 'NO_OF_CATEGORIES' for each Data_No which I feel is necessary for calculating Alpha. Instead of passing Data_No, we pass this 'NO_OF_CATEGORIES' and get the correct value in Alpha. Also I have added an extra bracket so Alpha value gets calculated correctly by following BODMAS rule. If you already have NO_OF_CATEGORIES as a column in any other data_frame with Data_No, please lookup merging data frames. – ealbsho93 Jul 23 '19 at 17:49
  • I think that's a good idea, however, with over 60 tables it looks really time consuming and I would hope that there is an easier way. I have re-posted my question here: https://stackoverflow.com/questions/57159082/calculating-bonferroni-lower-and-upper-confidence-intervals-for-several-datasets . Hopefully this makes it clearer? Thanks @ ealbsho93 – EleMan Jul 24 '19 at 06:57
  • Alternatively, can I create a new column called "categories" which counts the number of rows that makes a dataset and then use that value from the category column to do the next part of the calculation to get the Z value? Thanks – EleMan Jul 24 '19 at 07:01
  • Hi EleMan, as long as you have the data in given format and run the code, you should be able to generate the output. Counting the number of rows should work. – ealbsho93 Jul 24 '19 at 20:10