3

I am very new to this whole world of "big data" tech, and recently started reading about Spark. One thing that keeps coming up is SparkSQL, yet I consistently fail to comprehend was exactly it is.

Is it supposed to convert SQL queries to MapReduce that do operations on the data you give it? But aren't dataframes already essentially SQL tables in terms of functionality?

Or is it some tech that allows you to connect to an SQL database and use Spark to query it? In this case, what's the point of Spark in here at all - why not use SQL directly? Or is the point that you can use your structured SQL data in combination with the flat data?

Again, I am emphasizing that I am very new to all of this and may or may not talking out of my butt :). So please do correct me and be forgiving if you see that I'm clearly misunderstanding something.

Community
  • 1
  • 1

2 Answers2

2

Your first answer is essentially correct, it's a API in Spark where you can write queries in SQL and they will be converted to a parallelised Spark job (Spark can do more complex types of operations than just map and reduce). Spark Data frames actually are just a wrapper around this API, it's just an alternative way of accessing the API, depending on whether you're more comfortable coding in SQL or in Python/Scala.

maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • I see, makes sense now! I've noticed that Spark seems to have a lot of redundancy to accommodate different styles, like the fact that filter() and where() methods are literally the same method, and the only reason that where() exists is because its "more familiar" to people who used SQL. But just to be clear, SparkSQL has nothing to do with your regular relational databases like MySQL and Postgres? It's just a an API that allows you to make your queries in a very similar manner, but on flat data rather than structured, correct? –  Jan 18 '16 at 04:19
  • I guess this would be it's own question, but now that we are on this topic, might as well :). What is the difference between Hive and SparkSQL? I thought Hive was the tool you used to write SQL-like queries on flat data, so is SparkSQL a competitor then? Is it superior? –  Jan 18 '16 at 04:31
  • Yes this has nothing to do with with MySQL and Postres, it's just about SQL as a query language. HIve also uses SQL syntax but it runs on Hadoop which does a lot of disk i/o so can be pretty slow, whereas Spark is mostly in-memory (if you do it right) so it's much faster for a lot of types of things. Especially things like ad-hoc queries on your data, SparkSQL should come back in seconds, wheras Hive might take minutes. – maxymoo Jan 18 '16 at 04:35
0

Spark

Spark is a Framework or very big set of components using for Scalable, efficient analysis of Big Data.

For example: People are uploading a petabyte of video to YouTube every day. Now the time it takes to read one terabyte from a disk is about three hours at 100 megabytes per second. That's actually quite a long time(inexpensive of disk cannot helps us here). So the challenge we face is that one machine cannot process, or even store, all of the data. So our solution is distributed data over cluster of machines.

DataFrames are the primary abstraction in Spark.

We can construct a data frame from text files, Json files, Hadoop Distributed File System, Apache Parquet or Hypertable or Amazon S3 file, Apache HBase and then perform some operations, transformation on it regardless where the data come from.

Spark Sql

Spark SQL is a Spark module for structured data processing. as describing on the documentation page here.

So one of the interests of Spark SQL is that it allows us to query structured data from many data sources with an SQL syntax and offering many others possibilities. I think it is for this reason we don't use SQL directly.

Orleando Dassi
  • 454
  • 6
  • 17