25

How do I create a histogram of an array of integers? For example:

data = [0,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6,6,6,7,7,7,7,7,8,9,9,10]

I want to create a histogram based on how many entries there are for 0, 1, 2, and so on. Is there an easy way to do it in Ruby?

The output should be two arrays. The first array should contain the groups (bins), the second array should contain the number of occurrences (frequencies).

For data given above, I would expect the following output:

bins         # => [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
frequencies  # => [1, 1, 5, 6, 4, 2, 3, 5, 1, 2, 1]
wteuber
  • 1,208
  • 9
  • 15
Whitecat
  • 3,882
  • 7
  • 48
  • 78
  • 1
    What is the output format you want? – sawa Sep 30 '13 at 18:29
  • When you ask a question, asking for code, you need to show your research and any attempts you made to solve the problem, along with your explanation why they didn't work. – the Tin Man Sep 30 '13 at 19:45
  • I would initialise a "counting hash" like that `h = Hash.new(0)` and count occurrences of each element: `data.each{|v| h[v] += 1}`. After that, `h` will look like this: `=> {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}` You can extract bins and freqs from that Hash using `h.keys` and `h.values`. I hope you find that useful. – wteuber Aug 21 '18 at 23:14
  • 2
    Ruby 2.7.0 introduces `Enumerable#tally`: `data.tally` => `{0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}` – dug Jun 07 '19 at 21:53

2 Answers2

61

Ruby's Array inherits group_by from Enumerable, which does this nicely:

Hash[*data.group_by{ |v| v }.flat_map{ |k, v| [k, v.size] }]

Which returns:

{
     0 => 1,
     1 => 1,
     2 => 5,
     3 => 6,
     4 => 4,
     5 => 2,
     6 => 3,
     7 => 5,
     8 => 1,
     9 => 2,
    10 => 1
}

That's just a nice 'n clean hash. If you want an array of each bin and frequency pair you can shorten it and use:

data = [0,1,2,2,3,3,3,4]
data.group_by{ |v| v }.map{ |k, v| [k, v.size] }
# => [[0, 1], [1, 1], [2, 2], [3, 3], [4, 1]]

Here's what the code and group_by is doing with the smaller dataset:

data.group_by{ |v| v }    
# => {0=>[0], 1=>[1], 2=>[2, 2], 3=>[3, 3, 3], 4=>[4]}

data.group_by{ |v| v }.flat_map{ |k, v| [k, v.size] }  
# => [0, 1, 1, 1, 2, 2, 3, 3, 4, 1]

As mentioned by Telmo Costa in the comments, Ruby introduced tally in v2.7.0. Running a quick benchmark shows that tally is about 3x faster:

require 'fruity'

puts "Ruby v#{RUBY_VERSION}"

data = [0,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6,6,6,7,7,7,7,7,8,9,9,10]

data.group_by{ |v| v }.map{ |k, v| [k, v.size] }.to_h
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
data.group_by { |v| v }.transform_values(&:size)
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
data.tally 
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
data.group_by{ |v| v }.keys.sort.map { |key| [key, data.group_by{ |v| v }[key].size] }.to_h
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}

compare do
  gb { data.group_by{ |v| v }.map{ |k, v| [k, v.size] }.to_h }
  rriemann { data.group_by { |v| v }.transform_values(&:size) }
  telmo_costa { data.tally }
  CBK {data.group_by{ |v| v }.keys.sort.map { |key| [key, data.group_by{ |v| v }[key].size] }.to_h }
end

Resulting in:

# >> Ruby v2.7.0
# >> Running each test 1024 times. Test will take about 2 seconds.
# >> telmo_costa is faster than rriemann by 2x ± 0.1
# >> rriemann is similar to gb
# >> gb is faster than CBK by 8x ± 1.0

So use tally.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • 1
    The latest ruby versions allow to use some syntax sugar for a shorter version: `data = group_by { |v| v }.transform_values(&:size)` – rriemann Oct 01 '18 at 10:04
  • This should be the accepted answer! – sambecker Apr 06 '19 at 03:46
  • 2
    You can use `itself`: `data.group_by(&:itself).transform_values(&:size)`. Or, has it has been said before, starting on Ruby 2.7.0 `data.tally`. – Telmo Costa Dec 16 '19 at 14:37
  • To sort the keys in ascending order: `data.group_by{ |v| v }.keys.sort.map do |key| [key, data.group_by{ |v| v }[key].size] end` – CBK Jan 30 '20 at 21:28
  • No, don't. It's at least 8x slower if you use the OPs sample array. See the benchmarks. – the Tin Man Feb 02 '20 at 23:07
10

Use "histogram".

data = [0,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6,6,6,7,7,7,7,7,8,9,9,10]
(bins, freqs) = data.histogram 

This will create an array bins containing the bins of histogram and the array freqs containing the frequencies. The gem also supports different binning behaviors and weights/fractions.

Hope this helps.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Rahul Jiresal
  • 1,006
  • 13
  • 24