2

data.fu has a nice implementation of HyperLogLog for estimating cardinality here

However, it's implemented as Accumulator which means it will run only at the reducer and not in the combiner (but it will never load the entire set into memory as in normal EvalFunc). Why couldn't data.fu implement it as Algebraic - and fill the registers at every combiner, then merge and reduce the result? Am I missing something here?

ihadanny
  • 4,377
  • 7
  • 45
  • 76
  • I'm voting to close this question as because questions of 'why' software is working as it does are not on topic here. The question might be on-topic for programming-SE but I am not sure. – Dennis Jaheruddin Jun 06 '16 at 10:40

1 Answers1

0

Fixed in 1.3.0, and now it does use Algebraic. see https://issues.apache.org/jira/browse/DATAFU-91

See details of how this improves a task from 10 minutes to 2 minutes: https://docs.google.com/spreadsheets/d/1oVYSCh22kufgQ49pgsuboKOMxDgz8N5yBtRpxuo69Lk/edit#gid=0

ihadanny
  • 4,377
  • 7
  • 45
  • 76