2

I have the results of a simulation that involved removing data and refitting a model, and generated the mean and CIs for 5 beta coefficients (AAA:EEE). The sample data are reproducible through dupt().

data <- structure(list(PercentData = structure(c(1L, 1L, 1L, 1L, 1L,  2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("90Percent",  "80Percent", "70Percent", "60Percent", "50Percent", "40Percent",  "30Percent", "20Percent"), class = "factor"), Beta = c("AAA",  "BBB", "CCC", "DDD", "EEE", "AAA", "BBB", "CCC", "DDD", "EEE",  "AAA", "BBB", "CCC", "DDD", "EEE"), Mean = c(-0.0184798128725727,  0.577389832570274, 0.307079889066798, -1.04434737355186, 0.765444299971639,  -0.0342811658086197, 0.571119844203796, 0.307904693724208, -1.05833526491829,  0.772586633692223, -0.0287982339992084, 0.567559187110271, 0.300408471488675,  -1.05392763762688, 0.768956684863523), UpperCI = c(0.011382484714714,  0.592146704143253, 0.334772268551607, -0.997865978815953, 0.787196643647358,  0.0270716705899447, 0.595047291677895, 0.363220155550484, -1.01101175408862,  0.82142109640807, 0.0501543137571774, 0.597455743424951, 0.351903162023205,  -1.00408187639287, 0.805740012899328), LowerCI = c(-0.0483421104598594,  0.562632960997295, 0.279387509581988, -1.09082876828776, 0.743691956295919,  -0.0956340022071842, 0.547192396729696, 0.252589231897933, -1.10565877574796,  0.723752170976376, -0.107750781755594, 0.537662630795591, 0.248913780954145,  -1.10377339886088, 0.732173356827717)), .Names = c("PercentData",  "Beta", "Mean", "UpperCI", "LowerCI"), row.names = c("X1", "X2",  "X3", "X4", "X5", "X1.1", "X2.1", "X3.1", "X4.1", "X5.1", "X1.2",  "X2.2", "X3.2", "X4.2", "X5.2"), class = "data.frame")

head(data)
#      PercentData Beta        Mean     UpperCI     LowerCI
# X1     90Percent  AAA -0.01847981  0.01138248 -0.04834211
# X2     90Percent  BBB  0.57738983  0.59214670  0.56263296
# X3     90Percent  CCC  0.30707989  0.33477227  0.27938751
# X4     90Percent  DDD -1.04434737 -0.99786598 -1.09082877
# X5     90Percent  EEE  0.76544430  0.78719664  0.74369196
# X1.1   80Percent  AAA -0.03428117  0.02707167 -0.09563400

I can plot the simulation data using this code

require(ggplot2)

ggplot(data, aes(x = Beta)) +
  geom_point(aes(y = Mean, color = PercentData),
             position = position_dodge(0.5), 
             size=2.5) +
    geom_errorbar(aes(ymin = LowerCI, 
                      ymax = UpperCI, 
                      color = PercentData), 
                  cex = 1.25, 
                  width = .75, 
                  position = position_dodge(0.5))

enter image description here

I want to add the "truth" to the above figure. Currently, I have the truth data in a different DF, which is below.

truth <- structure(list(Est = c(-0.0178489366139546, 0.575347417798796,  0.299445933484525, -1.02862600141036, 0.767365594695577), UpperCI = c(0.486793276079609,  0.647987076085212, 0.380433141441644, -0.937511307956846, 0.837682594951183 ), LowerCI = c(-0.522491149307518, 0.502707759512379, 0.218458725527406,  -1.11974069486387, 0.697048594439971), Beta = c("AAA", "BBB",  "CCC", "DDD", "EEE")), .Names = c("Est", "UpperCI", "LowerCI",  "Beta"), row.names = c(NA, 5L), class = "data.frame") 

head(truth)
#           Est    UpperCI    LowerCI Beta
# 1 -0.01784894  0.4867933 -0.5224911  AAA
# 2  0.57534742  0.6479871  0.5027078  BBB
# 3  0.29944593  0.3804331  0.2184587  CCC
# 4 -1.02862600 -0.9375113 -1.1197407  DDD
# 5  0.76736559  0.8376826  0.6970486  EEE

I would like to add the truth data as a line to the above figure and have provided a schematic below where the added black lines are the truth$Est values - although they are not drawn to represent the actual values.

If possible, it would be nice to also include the truth Upper and Lower CIs. Is it possible to draw two lines - one at each CI value?

enter image description here

I have left the truth data as a separate DF as I am not sure on the best way to format the data for the intended result. I can reformat based on comments or suggestions to have the data in a single melt() data frame.

Thanks in advance.

dayne
  • 7,504
  • 6
  • 38
  • 56
B. Davis
  • 3,391
  • 5
  • 42
  • 78

1 Answers1

3

With a little bit of data restructuring, this becomes simple with the use of geom_segment:

all.data <- merge(data, truth, by = "Beta")
all.data$xposition <- as.numeric(factor(all.data$Beta))

ggplot(all.data, aes(x = Beta)) +
  geom_point(aes(y = Mean, color = PercentData),
             position = position_dodge(0.5), 
             size=2.5) +
  geom_errorbar(aes(ymin = LowerCI.x, 
                    ymax = UpperCI.x, 
                    color = PercentData), 
                cex = 1.25, 
                width = .75, 
                position = position_dodge(0.5)) +
  geom_segment(aes(y = UpperCI.y, 
                   yend = UpperCI.y, 
                   x = xposition - 0.5, 
                   xend = xposition + 0.5)) +
  geom_segment(aes(y = LowerCI.y, 
                   yend = LowerCI.y, 
                   x = xposition - 0.5, 
                   xend = xposition + 0.5))

enter image description here

A few things to note:

  • The easiest way to add additional data with an additional geom to your plot is to include it as a separate column in your dataframe. This is no different than including the confidence interval columns for drawing errorbars
  • To determine the horizontal position of the segments, you can use the numeric values of the factor of your categorical x variable. As explained by Hadley, categorical variables still have numeric position on a plot.
  • You can change the width of your bars by changing the value added and subtracted to x and xend (currently 0.5)
Joe
  • 3,831
  • 4
  • 28
  • 44
  • very helpful. Thanks! Cool trick with the addition of xposition. – B. Davis May 08 '15 at 15:05
  • @B.Davis glad to help- also just realized I made a rookie mistake in the code above. Whatever you do *don't* name your new dataframe `all` unless you want to overwrite the `all()` function... – Joe May 08 '15 at 15:10
  • 2
    You may also add "x.position" to "truth" instead and use the `data` argument in `geom_segment`. Thus, you may use data from two (or more) data frames when you add layers, and there is no need for `merge`. It's a quite useful strategy, e.g. when you use e.g. raw data from one data frame and some form of aggregated (or whatever kind of) data from another data frame. – Henrik May 08 '15 at 15:19
  • @Henrik true. I generally find it easier to stay mentally organized if each plot has one dataframe, but that's entirely a stylistic choice. – Joe May 08 '15 at 15:22