-1

Explain the connection between libraries, such as SparkSQL, MLib, GraphX and Spark Streaming,and the core Spark platform

1 Answers1

1

Basically, Spark is the base, an engine that allows the large-scale data processing with high performance. It provides an interface for programming with implicit data parallelism and fault tolerance.

GraphX, MLlib, Spark Streaming and Spark SQL are modules built on top of this engine, each of this has a different goal. Each of these libraries has new objects and functions that provide support for certain types of structures or features.

For example:

  • GraphX is a distributed graph processing module which allows representing a graph and applies efficient transformations, partitions and algorithms specialized for this kind of structure.
  • MLlib is a distributed machine learning module on top of Spark which implements certain algorithms like classification, regression, clustering,...
  • Spark SQL introduce the notion of DataFrames, the most important structure in this module, which allows applying SQL operations (e.g. select, where, groupBy, ...)
  • Spark Streaming is an extension of the core Spark which ingests data in mini-batches and performs transformations on those mini-batches of data. Spark Streaming has support built-in to consume from Kafka, Flume, and others platforms

Spark ecosystem

You can combine these modules according to your need. For example, if you want to process a large graph for applying a clustering algorithm, then you can use the representation provided by GraphX and use MLlib for apply K-means on this representation.

Doc

fingerprints
  • 2,751
  • 1
  • 25
  • 45