0

I want to make a histogram of the counts of a small set of numbers. I used:

  1. 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8
  2. 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 6
  3. 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6

I only get the correct plot for 2) -- for both 1) and 3) the count for the categories of 2 and 3 are in correct

> symm <- c(2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8) 

> hist(symm)

> posskew <- c(2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 6)

> hist(posskew)

> negskew <- c(2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6)

> hist(negskew)

I expect to get a bar going to the frequency of 1 for the category of 2 in symmetric. Instead, my bar goes up to 3 for the categories of 2 and 3 (something similar occurs in the negatively skewed)

joran
  • 169,992
  • 32
  • 429
  • 468
zeigna
  • 23
  • 4
  • Sorry @MrFlick! I searched and did not the spot the answer (or did not recognize it was the answer) – zeigna Oct 18 '19 at 23:10

1 Answers1

1

Histograms are intended for continuous data. The hist functions expects that it will need to bin the data, and offers quite a few options for algorithms to compute the bins, which you can see in ?hist.

For data like yours that is essentially discrete, a barplot is more appropriate. This will not bin values together.

par(mfrow = c(1, 3))
barplot(table(symm))
barplot(table(posskew))
barplot(table(negskew))

enter image description here

barplot itself can cause problems, because your data isn't actually categorical. The distance between 3 and 5 has meaning, and that meaning is the same whether or not there are any 4s. The code I show above will put 3 right next to 5 if there aren't any 4s.

problem = c(2, 3, 3, 5, 5, 5, 6, 6, 7)
barplot(table(problem))

enter image description here

A fix is to actually make it categorical, converting to factor and specifying the levels.

enter image description here

fix = factor(problem, levels = seq(min(problem), max(problem)))
barplot(table(fix))

The other fix is to force hist to use the breaks you want.

hist(problem, breaks = seq(min(problem) - 1, max(problem)))

enter image description here

This approach would work just as well for your examples. However, notice how much less useful the x-axis labels are in this case. hist is still binning the data---each bar covers a range---we've just forced the bins to include one unique value each. But the left bar covers all data from 1 to 2 (a single 2 in the data), the next bar covers all data greater than 2 through 3 (2 3s in the data), etc. But just looking at the histogram, you might think the data includes a single 1 and two 2s, when in fact in contains no 1s and a single 2.

Summary: If your data has evenly spaced values (with some missing), and you want one bar per unique value, barplot is great. If your data has more unique values then you want bars, then you need to bin and hist is great.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294