How to visualize change in binary/categorical data over time?

Question

>dput(data)
structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 
3, 3), Dx = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1), Month = c(0, 
6, 12, 18, 24, 0, 6, 12, 18, 24, 0, 6, 12, 18, 24), score = c(0, 
0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0)), .Names = c("ID", 
"Dx", "Month", "score"), row.names = c(NA, -15L), class = "data.frame")

>data
    ID Dx Month score
1   1  1     0     0
2   1  1     6     0
3   1  1    12     0
4   1  1    18     1
5   1  1    24     1
6   2  1     0     1
7   2  1     6     1
8   2  2    12     1
9   2  2    18     0
10  2  2    24     1
11  3  1     0     0
12  3  1     6     0
13  3  1    12     0
14  3  1    18     0
15  3  1    24     0

Suppose I have the above data.frame. I have 3 patients (ID = 1, 2 or 3). Dx is the diagnosis (Dx = 1 is normal, = 2 is diseased). There is a month variable. And last but not least, is a test score variable. The participants' test score is binary, and it can change from 0 or 1 or revert back from 1 to 0. I am having trouble coming up with a way to visualize this data. I would like an informative graph that looks at:

The trend of the participants' test scores over time.
How that trend compares to the participants' diagnosis over time

In my real dataset I have over 800 participants, so I do not want to construct 800 separate graphs ... I think the test score variable being binary really has me stumped. Any help would be appreciated.

Having 800 trends in one graph would be messy, can't you aggregate them or something? — Soheil, May 04 '15 at 08:41
Patient score over time can be tracked in a Shewhart chart, see package qcc. You can choose from EWMA, CUSUM or a Shewhart that is particular to your situation, e.g. a C chart [month count] or a U chart [monthly rates]. — Henk, May 04 '15 at 08:52

score 4 · Answer 1 · edited May 23 '17 at 12:24

With ggplot2 you can make faceted plots with subplots for each patient (see my solution for dealing with the large number of plots below). An example visualization:

library(ggplot2)
ggplot(data, aes(x=Month, y=score, color=factor(Dx))) +
  geom_point(size=5) +
  scale_x_continuous(breaks=c(0,6,12,18,24)) +
  scale_color_discrete("Diagnosis",labels=c("normal","diseased")) +
  facet_grid(.~ID) +
  theme_bw()

which gives:

enter image description here

Including 800 patients in one plot might be a bit too much as already mentioned in the comments of the question. There are several solutions to this problem:

Aggregate the data.
Create patient subgroups and make a plot for each subgroup.
Filter out all the patients who have never been ill.

With regard to the last suggestion, you can do that with the following code (which I adapted from an answer to one of my own questions):

deleteable <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))
data2 <- data[deleteable==0,]

You can use this as well for creating a new variable identifying patient who have been ill:

data$neverill <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))

Then you can for example aggregate the data with the several grouping variables (e.g. Month, neverill).

Chris · Accepted Answer · 2015-05-04T20:00:47.043

Note: A lot of the following data manipulation needs to be done for part 2. Part 1 is less complex, and you can see it fit in below.

Uses

library(data.table)
library(ggplot2)
library(reshape2)

To Compare

First, change the Dx from 1 to 2 to 0 to 1 (Assuming that a 0 in score corresponds to a 1 in Dx)

data$Dx <- data$Dx - 1

Now, create a matrix that returns a 1 for a 1 diagnosis with a 0 test, and a -1 for a 1 test with a 0 diagnosis.

compare <- matrix(c(0,1,-1,0),ncol = 2,dimnames = list(c(0,1),c(0,1)))
> compare
  0  1
0 0 -1
1 1  0

Now, lets score every event. This simply looks up the matrix above for every entry in your matrix:

data$calc <- diag(compare[as.character(data$Dx),as.character(data$score)])

*Note: This can be sped up for large matrices using matching, but it is a quick fix for smaller sets like yours

To allow us to use data.table aggregation:

data <- data.table(data)

Now we need to create our variables:

tograph <- melt(data[, list(ScoreTrend = sum(score)/.N, 
                            Type = sum(calc)/length(calc[calc != 0]), 
                            Measure = sum(abs(calc))), 
                     by = Month],
                id.vars = c("Month"))

ScoreTrend: This calculates the proportion of positive scores in each month. Shows the trend of scores over time
Type: Shows the proportion of -1 vs 1 over time. If this returns -1, all events were score = 1, diag = 0. If it returns 1, all events were diag = 1, score = 0. A zero would mean a balance between the two
Measure: The raw number of incorrect events.

We melt this data frame along month so that we can create a facet graph.

If there are no incorrect events, we will get a NaN for Type. To set this to 0:

tograph[value == NaN, value := 0]

Finally, we can plot

ggplot(tograph, aes(x = Month, y = value)) + geom_line() + facet_wrap(~variable, ncol = 1)

We can now see, in one plot:

The number of positive scores by month
The proportion of under vs. over diagnosis
The number of incorrect diagnoses.

How to visualize change in binary/categorical data over time?

2 Answers2

Linked