1

I have a dataset that contains data about the diagnosis of a number of medical conditions in patients. The values in these dimensions/columns are NA (patient was not in the diagnosis process), 1 = yes (diagnosed with the condition), 2 = No (not diagnosed with the condition). I want to create a separate column that counts the number of columns having 1 for each patient ID. My data frame has 1000 separate dimensions and I need to specifically pick based on either the index of the column or column name. my dataset looks like this

ID    A   B  C  D

1    NA   1  2  1

2    1    1  2  1

3    NA   2  2  1

4    1    1  2  1

5    2    1  1  1

6    2    1  NA 1

I obviously have way more columns and rows than this and any solution will require to consider the indexing. The reason I need to do this is to know the number of medical condition a patient is suffering from. I was able to solve this problem in SPSS using the 'transform' option as SPSS is my application of choice for data analysis. I have been enjoying R and intrigued by the number of statistical learning algos available in it.

I have tried the apply function and the colSums but I am struggling with indexing as well as setting the count function = +1. I have looked into for loops as well with no luck into my specific problem.

o6aid
  • 13
  • 5
  • Does this help: https://stackoverflow.com/questions/20305851/r-how-to-count-occurrences-of-values-across-multiple-columns-of-a-data-frame-and – TJ87 Aug 28 '19 at 17:09

2 Answers2

1

mydata$count <- apply(X=mydata[,-1],1,FUN=function(x) length(which(x=='1')))

using the [,-1] ignores the ID variable

TJ87
  • 404
  • 1
  • 3
  • 13
  • works like a peach. I have one more question, how do I ignore variables say from 1 to 15 and 39 to 60. I tried something like this "X=mydata[-1:-15, -38:-50]..." but gave me an error saying "Error in `$<-.data.frame`(`*tmp*`, count, value = c(4L, 4L, 5L, 5L, 3L, : replacement has 7396 rows, data has 7403". How do I deal with this? I am new so forgive me for asking what might seem like a bad question. Thanks a lot for the help though! – o6aid Aug 28 '19 at 18:17
  • Instead of using negative indexing, you could select the column numbers you do want to include. – TJ87 Aug 28 '19 at 19:44
  • However, your negative indexing would look like this: `mydata$count <- apply(X=mydata[,-c(1:15,38:50)],1,FUN=function(x) length(which(x=='1')))` – TJ87 Aug 28 '19 at 20:00
0

Welcome to SO. From my perspective, what you need to do here is to convert your data format from "wide" to "long" (check out tidy data: https://vita.had.co.nz/papers/tidy-data.pdf).

library(tidyverse)

df <- read.table(header = TRUE, 
text = "
ID    A   B  C  D

1    NA   1  2  1

2    1    1  2  1

3    NA   2  2  1

4    1    1  2  1

5    2    1  1  1

6    2    1  NA 1")

df %>%
 gather(key = "patient", value = "type", A:D) %>%
 mutate(contains_1 = ifelse(type == 1, "yes", "no"))

   ID patient type contains_1
1   1       A   NA       <NA>
2   2       A    1        yes
3   3       A   NA       <NA>
4   4       A    1        yes
5   5       A    2         no
6   6       A    2         no
7   1       B    1        yes
8   2       B    1        yes
9   3       B    2         no
10  4       B    1        yes
11  5       B    1        yes
12  6       B    1        yes
13  1       C    2         no
14  2       C    2         no
15  3       C    2         no
16  4       C    2         no
17  5       C    1        yes
18  6       C   NA       <NA>
19  1       D    1        yes
20  2       D    1        yes
21  3       D    1        yes
22  4       D    1        yes
23  5       D    1        yes
24  6       D    1        yes
Ben G
  • 4,148
  • 2
  • 22
  • 42