4

I have a dataframe that looks like this

Country    <- rep(c("Austria", "Austria","Belgium", "Belgium", "Spain", "Slovenia", "France"), times=3)
Institute  <- rep(c("Inst 1","Inst 2","Inst 3","Inst 4","Inst 5","Inst 6","Inst 7"), times=3)
Ans        <- rep(c(1,2,3,1,NA,2,2),times=3)
Category.1 <- rep(c("Cat 1", "Cat 2", "Cat 2", "Cat 2","Cat 2", "Cat 1", "Cat 1"),times=3)
Category.2 <- rep(c("P", "L", "M", "P", "P", "L", "M"),times=3)
qs  <- c(rep("Q1.a-Some Text", times=7),rep("Q1.b-Some Text", times=7), rep("Q1.c-Some Text", times=7))    
df <- data.frame(Country=Country,Institute=Institute, Category.1=Category.1, Category.2=Category.2, qs=qs, Ans=Ans)
df<-df %>% spread(qs,Ans)
head(df)

 Country Institute Category.1 Category.2 Q1.a-Some Text Q1.b-Some Text Q1.c-Some Text
1  Austria    Inst 1      Cat 1          P              1              1              1
2  Austria    Inst 2      Cat 2          L              2              2              2
3  Belgium    Inst 3      Cat 2          M              3              3              3
4  Belgium    Inst 4      Cat 2          P              1              1              1
5   France    Inst 7      Cat 1          M              2              2              2
6 Slovenia    Inst 6      Cat 1          L              2              2              2

Short explanation of the dataframe: There is some question, say Q1, and for this question there are multiple "Sub-questions", say a,b,c, where for each of these "sub-question/options" respondents were asked to answer using some scale, in this example from 1 to 3. My scope is to calculate the relative frequencies for each subquestion, of each response. So, I use this function:

multichoice<-function(data, question.prefix){
  index<-grep(question.prefix, names(data))    # identifies the index for the available options in Q.12
  cases<-length(index)                # The number of possible options / columns 

  # Identify the range of possible answers for each question 
  # Step 1. Search for the min in each col and across each col choose the min
  # step 2. Search for the max in each col and across each col choose the max 

  mn<-min(data[,index[1:cases]], na.rm=T)
  mx<-max(data[,index[1:cases]], na.rm=T)
  d = colSums(data[, index] != 0, na.rm = TRUE)  # The number of elements across column vector, that are different from zero. 

  vec<-matrix(,nrow=length(mn:mx),ncol=cases)

  for(j in 1:cases){
    for(i in mn:mx){
      vec[i,j]=sum(data[, index[j]] == i, na.rm = TRUE)/d[j]  # This stores the relative responses for option j for the answer that is i
    }
  }

  vec1<-as.data.frame(vec)
  names(vec1)<-names(data[index])
  vec1<-t(vec1)
  return(vec1)
}

Calling, the function I get the desired dataframe.

q1  <- as.data.frame(multichoiceq4(df,"^Q1")) 
head(q1)

                     V1  V2        V3
Q1.a-Some Text 0.3333333 0.5 0.1666667
Q1.b-Some Text 0.3333333 0.5 0.1666667
Q1.c-Some Text 0.3333333 0.5 0.1666667

Which shows that for option "a", 33% of participants answered with 1, 50% with 2 etc ...

My QUESTION

I want to calculate, the same but conditional on the categories. So, I want to see how the relative frequencies will look like based on category1, category2. Can someone suggest me something on how I can do this ?

msh855
  • 1,493
  • 1
  • 15
  • 36

1 Answers1

3

I think you could make your code more flexible by keeping your data in a long format (that is, don't do df<-df %>% spread(qs,Ans)) and using dplyr, e.g.:

This part essentially reproduces functionality of your multichoice function:

df %>% 
    group_by(qs,Ans) %>% 
    summarize(total=n()) %>% 
    filter(!is.na(Ans)) %>% 
    mutate(frac=total/sum(total)) %>% 
    dcast(qs~Ans,value.var='frac')
#               qs         1   2         3
# 1 Q1.a-Some Text 0.3333333 0.5 0.1666667
# 2 Q1.b-Some Text 0.3333333 0.5 0.1666667
# 3 Q1.c-Some Text 0.3333333 0.5 0.1666667

And this one gives an example how it can be modified to take into account categories.

df %>% 
    group_by(qs,Category.1,Ans) %>% 
    summarize(total=n()) %>% 
    filter(!is.na(Ans)) %>% 
    mutate(frac=total/sum(total)) %>% 
    dcast(qs~Ans+Category.1,value.var='frac')
#               qs   1_Cat 1   1_Cat 2   2_Cat 1   2_Cat 2   3_Cat 2
# 1 Q1.a-Some Text 0.3333333 0.3333333 0.6666667 0.3333333 0.3333333
# 2 Q1.b-Some Text 0.3333333 0.3333333 0.6666667 0.3333333 0.3333333
# 3 Q1.c-Some Text 0.3333333 0.3333333 0.6666667 0.3333333 0.3333333
Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
  • That is really cool, but I am a bit confused about how I can read the second table. For example, in first one, it is clear that rows must add up to 100% and the interpretation is immediate - 0.33% to the Q1.a responded with 1, 50% with 2etc. Regarding the second table, however, am I am bit lost. – msh855 May 26 '16 at 21:09
  • In the headers, the value before the underscore shows the question number, and the next value shows the category. I could help you with your table if you showed an example of the expected output – Marat Talipov May 26 '16 at 22:10
  • Thank you, basically I would like my table to look in a way, that either the rows ad up to 100% or the columns - depending on what is more convenient, if necessary to split the table according to the different categories then it is also fine. In the second table, for example, this can happen if one adds the first column of the responses with the third, as this shows how this category responded to options a, b and c and their total indeed add up to 100%. – msh855 May 27 '16 at 07:25
  • can you please help me the table from the second set of results - i.e when one takes into account the categories - to look in a way that the columns are arranged per the set of categories. So, I want first to have the 1_Cat 1, 2_ Cat 1, 1_ Cat 2, 2_Cat 2 etc – msh855 Jun 01 '16 at 11:59
  • try changing orders in `group_by`, i.e. `group_by(qs,Ans,Category.1)` – Marat Talipov Jun 01 '16 at 14:18