How can I superimpose an arbitrary parametric distribution over a histogram using ggplot?

Question

I have made an attempt based on a Quick-R example, but I don't understand where the scaling factor comes from. Is this method reasonable? How can I modify it to use ggplot?

An example overplot the normal and lognormal distributions using this method follows:

## Get a log-normalish data set: the number of characters per word in "Alice in Wonderland"
alice.raw <- readLines(con = "http://www.gutenberg.org/cache/epub/11/pg11.txt", 
                       n = -1L, ok = TRUE, warn = TRUE,
                       encoding = "UTF-8")

alice.long <- paste(alice.raw, collapse=" ")
alice.long.noboilerplate <- strsplit(alice.long, split="\\*\\*\\*")[[1]][3]
alice.words <- strsplit(alice.long.noboilerplate, "[[:space:]]+")[[1]]
alice.nchar <- nchar(alice.words)
alice.nchar <- alice.nchar[alice.nchar > 0]

# Now we want to plot both the histogram and then log-normal probability dist
require(MASS)
h <- hist(alice.nchar, breaks=1:50, xlab="Characters in word", main="Count")
xfit <- seq(1, 50, 0.1)

# Plot a normal curve
yfit<-dnorm(xfit,mean=mean(alice.nchar),sd=sd(alice.nchar))
yfit <- yfit * diff(h$mids[1:2]) * length(alice.nchar) 
lines(xfit, yfit, col="blue", lwd=2) 

# Now plot a log-normal curve
params <- fitdistr(alice.nchar, densfun="lognormal")
yfit <- dlnorm(xfit, meanlog=params$estimate[1], sdlog=params$estimate[1])
yfit <- yfit * diff(h$mids[1:2]) * length(alice.nchar)
lines(xfit, yfit, col="red", lwd=2)

This produces the following plot: Plot produced by the code above, showing a histogram of word length superimposed with a normal distribution curve and a log-normal distribution curve

To clarify, I would like to have counts on the y-axis, rather than a density estimate.

note that a normal distribution does not make sense as words all have > 0 letters, and the values are discrete integers; normal is continuous. — David LeBauer, Jun 27 '12 at 14:12
Agreed - this is a toy example with a handy dataset. And a normal curve is probably inappropriate. — fmark, Jun 27 '12 at 23:29

score 12 · Accepted Answer · edited Jul 04 '12 at 00:14

Have a look at stat_function()

alice.raw <- readLines(con = "http://www.gutenberg.org/cache/epub/11/pg11.txt", 
                       n = -1L, ok = TRUE, warn = TRUE,
                       encoding = "UTF-8")

alice.long <- paste(alice.raw, collapse=" ")
alice.long.noboilerplate <- strsplit(alice.long, split="\\*\\*\\*")[[1]][3]
alice.words <- strsplit(alice.long.noboilerplate, "[[:space:]]+")[[1]]
alice.nchar <- nchar(alice.words)
alice.nchar <- alice.nchar[alice.nchar > 0]
dataset <- data.frame(alice.nchar = alice.nchar)
library(ggplot2)
ggplot(dataset, aes(x = alice.nchar)) + geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
    args = c(
      mean = mean(dataset$alice.nchar), 
      sd = sd(dataset$alice.nchar)), 
    colour = "red")

enter image description here

If you want to have counts on the y-axis as in the example, then you'll need a function that converts the density to counts:

dnorm.count <- function(x, mean = 0, sd = 1, log = FALSE, n = 1, binwidth = 1){
  n * binwidth * dnorm(x = x, mean = mean, sd = sd, log = log) 
}

ggplot(dataset, aes(x = alice.nchar)) + geom_histogram(binwidth=1.6) + 
  stat_function(fun = dnorm.count, 
                args = c(
                  mean = mean(dataset$alice.nchar), 
                  sd = sd(dataset$alice.nchar), 
                  n = nrow(dataset), binwidth=1.6), 
                colour = "red")

enter image description here

Nice. I think stat_function must be new. It is a great improvement over my previous approach, which was to create a data frame of x, dnorm(x, , ) first. — David LeBauer, Jun 27 '12 at 14:10
@David `stat_function` has been there for as long as I can remember! :) — joran, Jun 27 '12 at 14:42
This is great - is it possible to have counts on the y-axis though rather than density as in the example above? — fmark, Jul 02 '12 at 00:11
@fmark: you can. You'll need a function that converts the density to counts. — Thierry, Jul 02 '12 at 08:12
@Thierry Thanks for that. I've editted your answer a little because you need to include binwidth in the conversion function, and added the resulting plot. — fmark, Jul 04 '12 at 00:14

How can I superimpose an arbitrary parametric distribution over a histogram using ggplot?

1 Answers1