5

I would like to create a dichotomous variable that tells me whether a participant gave the same response to each of 10 questions. Each row is a participant and I want to write a simple script to create this new variable/vector in my data frame. For example, if my data looks like the first 6 columns, then I'm trying to create the 7th one.

ID   Item1  Item2  Item3  Item4  Item5  | AllSame
1    5      5      5      5      5      | Yes
2    1      3      3      3      2      | No
3    2      2      2      2      2      | Yes
4    5      4      5      5      5      | No
5    5      2      3      5      5      | No

I've seen solutions on this set that compare one column to another, for example here with ifelse(data$item1==data$item2,1,ifelse(dat$item1==data$item3,0,NA)), but I have 10 columns in my actual dataset and I figure there's got to be a better way than checking all 10 against each other. I also could create a a variable that counts how many equal 1, and then do a test for if the count is the same as the number of columns, but with 7 possible responses in the data once again this is looking very unweildy and I'm hoping someone has a better solution. Thank you!

Bofstein
  • 55
  • 1
  • 6
  • Perhaps better on stackoverflow as this is programming rather than statistics – Henry Jun 23 '16 at 00:36
  • How do you want it to behave if there are all "NA" values in one row? – Glen_b Jun 23 '16 at 00:50
  • Possible duplicate of [Test for equality among all elements of a single vector](http://stackoverflow.com/questions/4752275/test-for-equality-among-all-elements-of-a-single-vector) – Glen_b Jun 23 '16 at 00:51
  • You should make your table into a minimal reproducible example. – Glen_b Jun 23 '16 at 00:58
  • Yes, it does seem to make more sense here, thank you. And Glen_b there are some solutions there, thanks for the reference. As for a column of all NAs, I'd want that to end up as NA. I used the solution below by Henry - will that do that? It seems to from my code since I have about the right number of NAs for the number I would expect to be totally blank. – Bofstein Jun 23 '16 at 01:23

2 Answers2

6

There will be many ways of doing this, but here is one

mydf <- data.frame(Item1 = c(5,1,2,5,5), 
                   Item2 = c(5,3,2,4,2), 
                   Item3 = c(5,3,2,5,3), 
                   Item4 = c(5,3,2,5,5),
                   Item5 = c(5,3,2,5,5) )

mydf$AllSame <- rowMeans(mydf[,1:5] == mydf[,1]) == 1

which leads to

> mydf
  Item1 Item2 Item3 Item4 Item5 AllSame
1     5     5     5     5     5    TRUE
2     1     3     3     3     3   FALSE
3     2     2     2     2     2    TRUE
4     5     4     5     5     5   FALSE
5     5     2     3     5     5   FALSE

And if you really must have "Yes" and "No" then use instead something like

mydf$AllSame <- ifelse(rowMeans(mydf[,1:5] == mydf[,1]) == 1, "Yes", "No")
Henry
  • 6,704
  • 2
  • 23
  • 39
  • This worked and is just want I was trying to get at, thank you! I do not at all need the Yes and No, though since you posted that script I changed it to 1 and 0 since that will make analysis easier. Note for anyone finding this later to avoid problem I had at first: the number at `== mydf[,1]` should be the first column you are looking at; I kept it as 1 since I didn't know what it was doing and all my responses were 0 at first. E.g. my final code was `data$SL_set1<- ifelse(rowMeans(data[,28:37] == data[,28]) == 1, 1, 0)` – Bofstein Jun 23 '16 at 01:21
  • I'm trying to figure out how this formula works so I understand it still enough to modify it, e.g. for string variables (which I don't think this can do) or for missing data. Is it checking if the mean of the 5 columns is the same value as the first column? Could that be a problem if the average happens to be the same as the first? I'm guessing not because that would lead to a lot of false positives but I don't understand how the formula works. I also only want it to be TRUE if there are no NAs in the row, so I'm thinking I need to ad na.omit somewhere. – Bofstein Jun 23 '16 at 01:30
  • It checks whether all the values in rows of the specified columns are all equal to the corresponding values in the specified column (if they are then it takes the average of a set of TRUEs, i.e. $1$s, in that row and this average is 1 iff they are all TRUE). It works with string variables so long as they are *not* factors. If there is an NA in a row, it gives NA instead of TRUE or FALSE. – Henry Jun 23 '16 at 07:49
1

Henry has posted a short and fast working solution that has already been accepted. I still wanted to add this alternative, which in my opinion has a slight advantage in readability:

mydf <- data.frame(Item1 = c(5,1,2,5,5), 
                   Item2 = c(5,3,2,4,2), 
                   Item3 = c(5,3,2,5,3), 
                   Item4 = c(5,3,2,5,5),
                   Item5 = c(5,3,2,5,5) )

mydf$AllSame <- apply(mydf, 1, function(row) all(row==row[1]))

The all() functions used here has a na.rm argument which can easily be set to TRUE, if you want NAs to be neglected.

Bernhard
  • 4,272
  • 1
  • 13
  • 23