Difference between BOINC and Hadoop/Spark/etc

Question

What's the difference between BOINC https://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing

vs. General Hadoop/Spark/etc. big data framework? They all seem to be distributed computing frameworks - are there places where I can read about the differences or BOINC in particular?

Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?

Thanks.

CERN does in fact utilize Hadoop + Spark. Example: http://openlab.web.cern.ch/technical-area/data-analytics — Justin Kestelyn, Jun 27 '16 at 22:22
The **sharpest difference** between BOINC and Hadoop/Spark/other distributed-computing platforms **is who pays the bills**. BOINC enjoys massive collection of externally invested CAPEX + externally financed OPEX co$t$. Having this "computing power for (almost) free" is a tempting topic. Legally, there is an important step -- whether one has explicitly expressed his/her will to sponsor such sponsored computing. BOINC is clear and ethical in this, but that same does not apply universally ( better check your Process Explorer to see what surprising task might got loaded onto your GPU engine etc ) — user3666197, Sep 20 '17 at 19:21

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

BOINC is software that can use the unused CPU and GPU cycles on a computer to do scientific computing

BOINC is strictly a single application that enables grid computing using unused computation cycles.

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce.

(emphasis added to framework and it's dual functionality)

Here, you see Hadoop is a framework (also referred to as an ecosystem) that has both storage and computing capabilities. Hadoop vendors such as Cloudera and Hortonworks bundle in additional functionality into that (Hive, Hbase, Pig, Spark, etc) as well as a few security/auditing tools.

Additionally, hardware failure is handled differently by these two clusters. If a BOINC node dies, there is no fault tolerance; those resources are lost. In the case of Hadoop, data is replicated and tasks are re-ran a certain number of times before eventually failing, but these steps are traceable as long as the logging services built into the framework are running.

Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?

Because BOINC provides a software that anyone in the world can install to join the cluster, they gain a large range of computing power from anywhere practically for free.

They might be using Hadoop internally to do some storage and perhaps Spark to do additional computing, but buying commodity hardware in bulk and building/maintaining that cluster seems cost prohibitive.

smoe · Answer 2 · 2017-06-24T14:33:12.740

What is similar between BOINC and Hadoop is that they exploit that a big problem can be solved in many parts. And both are most associated with distributing data across many computers, not an application.

The difference is the degree of synchronisation between all contributing machines. With Hadoop the synchronisation is very tight and you expect at some point all data to be collected from all machines to then come up with the final analysis. You literally wait for the last one and nothing is returned until that last fraction of the job was completed.

With BOINC, there is no synchronicity at all. You have many thousands of jobs to be run. The BOINC server side run by the project maintainers orchestrates the delivery of jobs to run to the BOINC client sides run by volunteers.

With BOINC, the project maintainers have no control over the clients at all. If a client is not returning a result then the work unit is sent elsewhere again. With Hadoop, the whole cluster is accessible to the project maintainer. With BOINC, the application is provided across different platforms since it is completely uncertain what platform the user offers. With Hadoop everything is well-defined and typically very homogeneous. BOINC's largest projects have many tens of thousands of regular volunteers, Hadoop has what you can afford to buy or rent.

Difference between BOINC and Hadoop/Spark/etc

2 Answers2