1

I have a python blaze data like this

import blaze as bz

bdata = bz.Data([(1, 'Alice', 100.9, 100),
           (2, 'Bob', 200.6, 200),
           (3, 'Charlie', 300.45, 300),
           (5, 'Edith', 400, 400)],
          fields=['id', 'name', 'revenue', 'profit'])

I would like to calculate mean for the numeric columns. I tried something like this

print {col: bdata[col].mean() for col in ['revenue', 'profit']}

and I get

{'profit': 250.0, 'revenue': 250.4875}

But I would like to calculate in a single shot like in pandas, like data.mean()

Any thoughts or suggestions???

Kathirmani Sukumar
  • 10,445
  • 5
  • 33
  • 34

2 Answers2

2

That Pandas aggregation is kind of magical, and I don't think you'll be able to skip the non-numerical columns without some kind of logic.

If you have the option to add a dummy column, you could use by to do an aggregation across the entire table.

That would look like this:

bdata = bz.Data([('fnord', 1, 'Alice', 100.9, 100),
           ('fnord', 2, 'Bob', 200.6, 200),
           ('fnord', 3, 'Charlie', 300.45, 300),
           ('fnord', 5, 'Edith', 400, 400)],
          fields=['dummy', 'id', 'name', 'revenue', 'profit'])
bz.by(bdata.dummy, avg_profit=bdata.profit.mean(), avg_revenue=bdata.revenue.mean())

   dummy  avg_profit  avg_revenue
0  fnord         250     250.4875

Though that's not particularly consise, either, and requires modifying your data.

You could use odo to get quick access to that concise Pandas syntax:

from odo import odo
import Pandas as pd
odo(bdata, pd.DataFrame).mean()
pneumatics
  • 2,836
  • 1
  • 27
  • 27
0

I think you might have better luck using the summary reduction:

from blaze import *

resume = summary(bdata,avg_profit=bdata.profit.mean(), avg_revenue=bdata.revenue.mean())
SummaryStats = pd.DataFrame(pd.Series(dict( (k,v) for k,v in zip(resume.fields,compute(resume)) ))).T

The last line can be reduced to compute(resume) if you don't care about the result being a pd.DataFrame.

tipanverella
  • 3,477
  • 3
  • 25
  • 41