9

I have a data frame that has a format like the following:

Month       Frequency
2007-08     2
2010-11     5
2011-01     43
2011-02     52
2011-03     31
2011-04     64
2011-05     73

I would like to create a histogram from this data, using X bins (X will probably be around 15, but the actual data has over 200 months), and using the data from the frequency column as the frequency for each bin of the histogram. How can I accomplish this?

I've tried two approaches so far, with the hist() and barplot() commands. The problem with hist() is that it does not seem to give me any way to specify that I want to use the frequency column in the frequency calculations for the histogram. The problem with barplot() is that I don't have any flexibility in choosing X bins, and if there are omitted months, then resulting graph is not actually a true histogram because the x-axis isn't continuous.

The only idea I have right now is to go with the barplot() approach, fill in the missing months with a value of 0 for Frequency, and use space=0 to remove the spacing between the bars. The problem with that is that it's not particularly easy to choose an arbitrary number of bins.

Stephen Booher
  • 6,522
  • 4
  • 34
  • 50

4 Answers4

4

take a gander at ggplot2.

if you data is in a data.frame called df:

ggplot(df,aes(x=Month,y=Frequency))+geom_bar(stat='identity')

or if you want continuous time:

df$Month<-as.POSIXct(paste(df$Month, '01', sep='-'),format='%Y-%m-%d')
ggplot(df,aes(x=Month,y=Frequency))+geom_bar(stat='identity')
Justin
  • 42,475
  • 9
  • 93
  • 111
  • When I attempt the second example, I get `Error: Non-continuous variable supplied to scale_y_continuous.` when I run the ggplot command. Any ideas? – Stephen Booher Feb 03 '12 at 18:45
  • I'd have to know more about your data. I assume your `Frequency` is not numeric. If i recreate your data: `df<-data.frame(Month=c('2007-08','2010-11','2011-01','2011-02','2011-03','2011-04','2011-05'),Frequency=c(2,5,43,52,31,64,73))`. It plots just fine with those two commands. Check the `str` of your data and make sure you're supplying a continuous variable for y. – Justin Feb 03 '12 at 19:00
  • I got the second example work by converting the Month column to a Date type. This is closer to what I want, but ideally, I would have less bins. It looks like this is giving a bar (effectively a bin) for every individual date. _Edit_: I just saw your comment above; let me update my question to address that. – Stephen Booher Feb 03 '12 at 19:01
  • Exactly. If you have a specific number of bins and dates to include, I think you would have to add them as zeros. Whatever the plotting method you choose will need to know the x-value of your "zero" bars. – Justin Feb 03 '12 at 19:08
4

To get this kind of flexibility, you may have to replicate your data. Here is one way of doing it with rep:

n <- 10
dat <- data.frame(
    x = sort(sample(1:50, n)),
    f = sample(1:100, n))
dat

expdat <- dat[rep(1:n, times=dat$f), "x", drop=FALSE]

Now you have your data replicated in the data.frame expdat, allowing you to call hist with different numbers of bins:

par(mfcol=c(1, 2))
hist(expdat$x, breaks=50, col="blue", main="50 bins")
hist(expdat$x, breaks=5, col="blue", main="5 bins")
par(mfcol=c(1, 1))

enter image description here

Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 1
    When I asked the question, I simplified it too much because I neglected to mention that my frequencies actually spanned between 1 and 50+ million rather than the simple example I gave. These frequencies were too high to use `rep` on the raw data on my machine (8 GB RAM). I converted these frequencies to a smaller scale (1 to 100,000) which gave me enough of a histogram (i.e., a probability distribution) for my purposes. I like your answer in general though, and so far it's the only solution that I have found that gives me a "real" histogram. Thanks! – Stephen Booher Feb 06 '12 at 21:01
  • If your frequencies are too high you might simply downscale the frequency factor like this: expdat <- dat[rep(1:n, times=dat$f / 1000), "x", drop=FALSE] – Marian Dec 12 '13 at 09:57
3

Yea, rep solutions will waste too much memory in most interesting/large cases. The HistogramTools CRAN package includes an efficient PreBinnedHistogram function which creates a base R histogram object directly from a list of bins and breaks as the original question provided.

MurrayStokely
  • 345
  • 2
  • 6
  • Thanks, this was really useful. The only disappointment is that the function only takes the arguments `breaks`, `counts` and `xnames` so presumably any fiddling around with other settings has to be done later e.g. `plot(myhist, axes = FALSE)` rather than setting `axes = FALSE` when the histogram is initially constructed. If anyone wants to see what the output looks like I included a histogram produced this way in this answer: http://stats.stackexchange.com/a/122853/22228 – Silverfish Nov 05 '14 at 23:54
0

Another possibility is to scale down your frequency variable by some large factor so that rep doesn't have as much work to do. Then adjust the vertical axis scale of the histogram by that same factor.

Nick
  • 43
  • 6