-2

I am very confusing about this two.I know shark is same as hive with 100x faster, work on spark. I want to know main difference between spark and shark. Which is better mean faster.

When I have to use spark or when shark?????

ChrisGPT was on strike
  • 127,765
  • 105
  • 273
  • 257
lucy s
  • 45
  • 1
  • 1
  • 5

1 Answers1

0

Spark is a framework for distributed data processing, you can write your code in Scala, Java and Python. Shark was renamed to SparkSQL and it is some kind of SQL engine on top of Spark - you write SQL queries and they are executed using Spark framework.

Here's Spark programming guide: https://spark.apache.org/docs/latest/programming-guide.html Here's Spark SQL guide: https://spark.apache.org/docs/latest/sql-programming-guide.html

So if you write a Spark SQL query, it would be converted to Spark code and executed, which means that in general you can write a Spark code that would work with the same speed or faster than Spark SQL query

0x0FFF
  • 4,948
  • 3
  • 20
  • 26
  • spark use in memory data so it is faster than hadoop. but when data will be in Tb then ... – lucy s Nov 21 '14 at 13:24
  • No. Both Spark and Hadoop MapReduce are frameworks for distributed data processing, but they are different. And Hadoop is not only MapReduce, it is a big ecosystem of products based on HDFS, YARN and MapReduce. Same for Spark, you have SparkSQL, Spark Streaming, MLlib, GraphX, Bagel. The general differences between Spark and MR are that Spark allows fast data sharing by holding all the data in memory by default and allowing general data processing graphs. MapReduce is always 2 steps - Map and Reduce, while in Spark there might be many maps, many reduces, groupbys, joins, etc. – 0x0FFF Nov 21 '14 at 13:28
  • but Hadoop is also evolving. You have Tez that allows you the same custom data processing pathes, you have in-memory caching, etc. The choise of the tool should be based on your use case. Spark is faster, but Hadoop at the moment is more mature – 0x0FFF Nov 21 '14 at 13:30
  • spark use in memory data so it is faster than hadoop. but when data will be in Tb then .?? or spark use hdfs for storage?? – lucy s Nov 21 '14 at 13:53
  • It stores data in memory by default, but refer here to see other options: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence - It can also store the data on disks or spill to disks when you don't have enough memory. About terabytes - single server with 700GB RAM is a common thing, a rack with 40 servers of this kind result in 28TB of RAM, which is not much at the moment – 0x0FFF Nov 21 '14 at 14:01