0

I'm trying to work out the most efficient way to loop through some deeply nested data, find the average of the values and return a new hash with the data grouped by the date.

The raw data looks like this:

[
    client_id: 2,
    date: "2015-11-14",
    txbps: {
        "22"=>{
            "43"=>17870.153846153848,
            "44"=>15117.866666666667
        }
    },
    client_id: 1,
    date: "2015-11-14",
    txbps: {
        "22"=>{
            "43"=>38113.846153846156,
            "44"=>33032.0
        }
    },
    client_id: 4,
    date: "2015-11-14",
    txbps: {
        "22"=>{
            "43"=>299960.0,
            "44"=>334182.4
        }
    },
]

I have about 10,000,000 of these to loop through so I'm a little worried about performance.

The end result, needs to look like this. The vals need to be the average of the txbps:

[
    {
        date: "2015-11-14",
        avg: 178730.153846153848
    },
    {
        date: "2015-11-15",
        avg: 123987.192873978987
    },
    {
        date: "2015-11-16",
        avg: 126335.982123876283
    }
]

I've tried this to start:

results.map { |val| val["txbps"].values.map { |a| a.values.sum } }

But that's giving me this:

[[5211174.189281798, 25998.222222222223], [435932.442835184, 56051.555555555555], [5718452.806735582, 321299.55555555556]]

And I just can't figure out how to get it done. I can't find any good references online either.

I also tried to group by the date first:

res.map { |date, values| values.map { |client| client["txbps"].map { |tx,a| { date: date, client_id: client[':'], tx: (a.values.inject(:+) / a.size).to_i } } } }.flatten

[
    {
        : date=>"2015-11-14",
        : client_id=>"2",
        : tx=>306539
    },
    {
        : date=>"2015-11-14",
        : client_id=>"2",
        : tx=>25998
    },
    {
        : date=>"2015-11-14",
        : client_id=>"2",
        : tx=>25643
    },
    {
        : date=>"2015-11-14",
        : client_id=>"2",
        : tx=>56051
    },
    {
        : date=>"2015-11-14",
        : client_id=>"1",
        : tx=>336379
    },
    {
        : date=>"2015-11-14",
        : client_id=>"1",
        : tx=>321299
    }
]

If possible, how can I do this in a single run.

---- EDIT ----

Got a little bit further:

res.map { |a,b|
  {
    date: a[:date], val: a["txbps"].values.map { |k,v|
      k.values.sum / k.size
    }.first
  }
}.
group_by { |el| el[:date] }.map { |date,list|
  {
    key: date, val: list.map { |elem| elem[:val] }.reduce(:+) / list.size
  }
}

But that's epic - is there a faster, simpler way??

Jenny Blunt
  • 1,576
  • 1
  • 18
  • 41
  • 1
    I'm sure readers can help if you clarify your question. Firstly, your raw data needs to be a Ruby object. It looks like it's an array of hashes, but if so you need to add an open brace before each `:client_id` key. (While you're at it, please assign it to a variable (e.g., `arr = [{ client_id: ,...]`, so we can refer to the variable in comments and answers without having to define it.) Also, please explain how you computed `avg: 178730.15...` for `date: "2015-11-14"`. It's an order of magnitude larger than all of the values of the input array, so I don't understand how it could be an average. – Cary Swoveland Nov 15 '15 at 03:32
  • I posted an answer based on my understand of the question, which could well be wrong. Please look it over and tell me if that's what you are looking for. If it is, I will edit my answer to provide a detailed explanation of the steps. – Cary Swoveland Nov 15 '15 at 04:32

2 Answers2

1

Try #inject

Like .map, It's a way of converting a enumerable (list, hash, pretty much anything you can loop in Ruby) into a different object. Compared to .map, it's a lot more flexible, which is super helpful. Sadly, this comes with a cost of the method being super hard to wrap your head around. I think Drew Olson explains it best in his answer.

You can think of the first block argument as an accumulator: the result of each run of the block is stored in the accumulator and then passed to the next execution of the block. In the case of the code shown above, you are defaulting the accumulator, result, to 0. Each run of the block adds the given number to the current total and then stores the result back into the accumulator. The next block call has this new value, adds to it, stores it again, and repeats.

Examples:

To sum all the numbers in an array (with #inject), you can do this:

array = [5,10,7,8]
#            |- Initial Value   
array.inject(0) { |sum, n| sum + n } #=> 30
#                   |- You return the new value for the accumulator in this block.

To find the average of an array of numbers, you can find a sum, and then divide. If you divide the num variable inside the inject function ({|sum, num| sum + (num / array.size)}), you multiply the amount of calculations you will have to do.

array = [5,10,7,8]
array.inject(0.0) { |sum, num| sum + num } / array.size #=> 7.5

Method

If creating methods on classes is your style, you can define a method on the Array class (from John Feminella's answer). Put this code somewhere before you need to find the sum or mean of an array:

class Array
  def sum
    inject(0.0) { |result, el| result + el }
  end

  def mean 
    sum / size
  end
end

And then

array = [5,10,7,8].sum #=> 30
array = [5,10,7,8].mean #=> 7.5

Gem

If you like putting code in black boxes, or really precious minerals, then you can use the average gem by fegoa89: gem install average. It also has support for the #mode and #median

[5,10,7,8].mean #=> 7.5

Solution:

Assuming your objects look like this:

data = [
    {
        date: "2015-11-14",
        ...
        txbps: {...},
    },
    {
        date: "2015-11-14",
        ...
        txbps: {...},
    },
    ...
]

This code does what you need, but it's somewhat complex.

class Array
  def sum
    inject(0.0) { |result, el| result + el }
  end

  def mean 
    sum / size
  end
end

data = (data.inject({}) do |hash, item|
    this = (item[:txbps].values.map {|i| i.values}).flatten # Get values of values of `txbps`
    hash[item[:date]] = (hash[item[:date]] || []) + this # If a list already exists for this date, use it, otherwise create a new list, and add the info we created above.
    hash # Return the hash for future use
end).map do |day, value| 
    {date: day, avg: value.mean} # Clean data
end

will merge your objects into arrays grouped by date:

{:date=>"2015-11-14", :avg=>123046.04444444446}
Community
  • 1
  • 1
Ben Aubin
  • 5,542
  • 2
  • 34
  • 54
  • Thanks for your answer and explanation. Do you think you could put this in context with my question? I don't really know what I should replace. – Jenny Blunt Nov 15 '15 at 02:14
  • @JennyBlunt, sure. I could. – Ben Aubin Nov 15 '15 at 02:18
  • @JennyBlunt, I think your raw data at the beginning is not formatted correctly. Are they objects? – Ben Aubin Nov 15 '15 at 02:23
  • It's datas from Mongo(id), I cleaned up a little. Generally Client.all.to_a kinda stuff – Jenny Blunt Nov 15 '15 at 02:30
  • Thanks :) And thanks for the explanation too. I ran the three methods in a loop on a live data set. Yours in by far the clearest but loses on speed (81s for 100 loops) :( My monstrosity and the other answer come in at around 32s. – Jenny Blunt Nov 15 '15 at 11:12
1

Data Structure

I assume your input data is an array of hashes. For example:

arr = [
  {
    client_id: 2,
    date: "2015-11-14",
    txbps: {
      "22"=>{
        "43"=>17870.15,
        "44"=>15117.86
      }
    }
  },
  {
    client_id: 1,
    date: "2015-11-15",
    txbps: {
      "22"=>{
        "43"=>38113.84,
        "44"=>33032.03,
      }
    }
  },

  {
    client_id: 4,
    date: "2015-11-14",
    txbps: {
      "22"=>{
        "43"=>299960.0,
        "44"=>334182.4
      }
    }
  },
  {
    client_id: 3,
    date: "2015-11-15",
    txbps: {
      "22"=>{
        "43"=>17870.15,
        "44"=>15117.86
      }
    }
  }
]

Code

Based on my understanding of the problem, you can compute averages as follows:

def averages(arr)
  h = arr.each_with_object(Hash.new { |h,k| h[k] = [] }) { |g,h|
    g[:txbps].values.each { |f| h[g[:date]].concat(f.values) } }
  h.merge(h) { |_,v| (v.reduce(:+)/(v.size.to_f)).round(2) }
end

Example

For arr above:

avgs = averages(arr)
  #=> {"2015-11-14"=>166782.6, "2015-11-15"=>26033.47} 

The value of the hash h in the first line of the method was:

{"2015-11-14"=>[17870.15, 15117.86, 299960.0, 334182.4],
 "2015-11-15"=>[38113.84, 33032.03, 17870.15, 15117.86]} 

Convert hash returned by averages to desired array of hashes

avgs is not in the form of the output desired. It's a simple matter to do the conversion, but you might consider leaving the hash output in this format. The conversion is simply:

avgs.map { |d,avg| { date: d, avg: avg } }
 #=> [{:date=>"2015-11-14", :avg=>166782.6},
 #    {:date=>"2015-11-15", :avg=>26033.47}]

Explanation

Rather than explain in detail how the method works, I will instead give an alternative form of the method does exactly the same thing, but in a more verbose and slightly less Ruby-like way. I've also included the conversion of the hash to an array of hashes at the end:

def averages(arr)
  h = {}
  arr.each do |g|
    vals = g[:txbps].values      
    vals.each do |f|
      date = g[:date]
      h[date] = [] unless h.key?(date)
      h[date].concat(f.values)
    end
  end

  keys = h.keys
  keys.each do |k|
    val = h[k]
    h[k] = (val.reduce(:+)/(val.size.to_f)).round(2)
  end

  h.map { |d,avg| { date: d, avg: avg } }
end

Now let me insert some puts statements to print out various intermediate values in the calculations, to help explain what's going on:

def averages(arr)
  h = {}
  arr.each do |g|
    puts "g=#{g}"
    vals = g[:txbps].values      
    puts "vals=#{vals}"
    vals.each do |f|
      puts "  f=#{f}"
      date = g[:date]
      puts "  date=#{date}"
      h[date] = [] unless h.key?(date)
      puts "  before concat, h=#{h}"
      h[date].concat(f.values)
      puts "  after concat, h=#{h}"
    end
    puts
  end

  puts "h=#{h}"
  keys = h.keys
  puts "keys=#{keys}"

  keys.each do |k|
    val = h[k]
    puts "  k=#{k}, val=#{val}"
    puts "  val.reduce(:+)=#{val.reduce(:+)}"
    puts "  val.size.to_f=#{val.size.to_f}"
    h[k] = (val.reduce(:+)/(val.size.to_f)).round(2)
    puts "  h[#{k}]=#{h[k]}"
    puts
  end

  h.map { |d,avg| { date: d, avg: avg } }
end

Execute averages once more:

averages(arr)

g={:client_id=>2, :date=>"2015-11-14", :txbps=>{"22"=>{"43"=>17870.15, "44"=>15117.86}}}
vals=[{"43"=>17870.15, "44"=>15117.86}]
  f={"43"=>17870.15, "44"=>15117.86}
  date=2015-11-14
  before concat, h={"2015-11-14"=>[]}
  after concat, h={"2015-11-14"=>[17870.15, 15117.86]}

g={:client_id=>1, :date=>"2015-11-15", :txbps=>{"22"=>{"43"=>38113.84, "44"=>33032.03}}}
vals=[{"43"=>38113.84, "44"=>33032.03}]
  f={"43"=>38113.84, "44"=>33032.03}
  date=2015-11-15
  before concat, h={"2015-11-14"=>[17870.15, 15117.86], "2015-11-15"=>[]}
  after concat, h={"2015-11-14"=>[17870.15, 15117.86], "2015-11-15"=>[38113.84, 33032.03]}

g={:client_id=>4, :date=>"2015-11-14", :txbps=>{"22"=>{"43"=>299960.0, "44"=>334182.4}}}
vals=[{"43"=>299960.0, "44"=>334182.4}]
  f={"43"=>299960.0, "44"=>334182.4}
  date=2015-11-14
  before concat, h={"2015-11-14"=>[17870.15, 15117.86],
                    "2015-11-15"=>[38113.84, 33032.03]}
  after concat, h={"2015-11-14"=>[17870.15, 15117.86, 299960.0, 334182.4],
                   "2015-11-15"=>[38113.84, 33032.03]}

g={:client_id=>3, :date=>"2015-11-15", :txbps=>{"22"=>{"43"=>17870.15, "44"=>15117.86}}}
vals=[{"43"=>17870.15, "44"=>15117.86}]
  f={"43"=>17870.15, "44"=>15117.86}
  date=2015-11-15
  before concat, h={"2015-11-14"=>[17870.15, 15117.86, 299960.0, 334182.4],
                    "2015-11-15"=>[38113.84, 33032.03]}
  after concat, h={"2015-11-14"=>[17870.15, 15117.86, 299960.0, 334182.4],
                   "2015-11-15"=>[38113.84, 33032.03, 17870.15, 15117.86]}

h={"2015-11-14"=>[17870.15, 15117.86, 299960.0, 334182.4],
   "2015-11-15"=>[38113.84, 33032.03, 17870.15, 15117.86]}
keys=["2015-11-14", "2015-11-15"]
  k=2015-11-14, val=[17870.15, 15117.86, 299960.0, 334182.4]
  val.reduce(:+)=667130.41
  val.size.to_f=4.0
  h[2015-11-14]=166782.6

  k=2015-11-15, val=[38113.84, 33032.03, 17870.15, 15117.86]
  val.reduce(:+)=104133.87999999999
  val.size.to_f=4.0
  h[2015-11-15]=26033.47

  #=> [{:date=>"2015-11-14", :avg=>166782.6},
  #    {:date=>"2015-11-15", :avg=>26033.47}]
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • I can't really claim to understand everything that's going on in there. It's on a par speed wise with mine but much tidier. – Jenny Blunt Nov 15 '15 at 11:13
  • The only challenge I have with it is that it outputs the date as the key. Ideally I need it like { key: "2015...", val: 11111 } etc. I can run through another map - not sure if it's possible on the fly – Jenny Blunt Nov 15 '15 at 11:25
  • Jenny, the last line of code in my initial answer (`args.map...`) showed how to obtain the array of hashes from the hash `avgs` that I have computed. I have made that more explicit in my revised answer. You could of course change the keys to `:key` and `:val` if you wish. – Cary Swoveland Nov 16 '15 at 07:38