Some basic question regarding Spark. Can we use spark only in the context of processing jobs?In our use case we have stream of positon and motion data which we can refine and save it to cassandra tables.That is done with kafka and spark streaming.But for a web user who want to view some report with some search criteria can we use Spark(Spark SQL).Or for this purpose should we restrict to cql ? If we can use spark , how can we invoke spark-sql from a webservice deployed in tomcat server.
Asked
Active
Viewed 548 times
1
-
Take a look at https://github.com/spark-jobserver/spark-jobserver#features. This could be used from your UI. I think this supports Spark SQL. – satish Jul 18 '16 at 16:59
-
Just now i noticed that spark-jobserver is integrated in datastax enterprise 4.8.Big data and cassandra is already there in many production level systems. So i am curious to know how they integrate such query related services in production. Is it using spark? Or using spark jobs, they are creating and updating query related data in many tables, and in web application they are querying such tables directly with CQL and datastax java drivers?Advantage of spark-sql is there when we have to join tables to get data.Which is the best method? – Krishna Kumari Jul 19 '16 at 04:02
1 Answers
0
Well, you can do it by passing a SQL request via HTML address like:
http://yourwebsite.com/Requests?query=WOMAN
At the receiving point, the architecture will be something like:
Tomcat+Servlet --> Apache Kafka/Flume --> Spark Streaming --> Spark SQL inside a SS closure
In the servlet (if you don't know what a servlet is, better look it up) in the webapplication folder in your tomcat, you will have something like this:
public class QueryServlet extends HttpServlet{
@Override
public void doGet(ttpServletRequest request, HttpServletResponse response){
String requestChoice = request.getQueryString().split("=")[0];
String requestArgument = request.getQueryString().split("=")[1];
KafkaProducer<String, String> producer;
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("acks", "all");
properties.setProperty("retries", "0");
properties.setProperty("batch.size", "16384");
properties.setProperty("auto.commit.interval.ms", "1000");
properties.setProperty("linger.ms", "0");
properties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.setProperty("block.on.buffer.full", "true");
producer = new KafkaProducer<>(properties);
producer.send(new ProducerRecord<String, String>(
requestChoice,
requestArgument));
In the Spark Streaming running application (which you need to be running in order to catch the queries, otherwise you know how long it takes Spark to start), You need to have a Kafka Receiver
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(batchInt*1000));
Map<String, Integer> topicMap = new HashMap<>();
topicMap.put("wearable", 1);
//FIrst Dstream is a couple made by the topic and the value written to the topic
JavaPairReceiverInputDStream<String, String> kafkaStream =
KafkaUtils.createStream(jssc, "localhost:2181", "test", topicMap);
After this, what happens is that
- You do a GET setting either the GET body or giving the argument to the query
- The GET is caught by your servlet, which immediately creates, send, close a Kafka Producer (it is possible to actually avoid the Kafka Step, simply sending your Spark Streaming app the information in any other way; see SparkStreaming receivers)
- Spark Streaming operates your SparkSQL code as any other submitted Spark application, but it keeps running waiting for other queries to come.
Of course, in the servlet you should check the validity of the request, but this is the main idea. Or at least the architecture I've been using

Vale
- 1,104
- 1
- 10
- 29
-
We can do spark SQL by integrating to kafka and streaming. That is O.K.But is there any alternative like sharing the sparkContext to a web application. In case if we use kafka ,do we have to create a topic for each query or keep a single topic with a generic request and response object with a type to identify the query.Performance wise which will be better? – Krishna Kumari Jul 19 '16 at 03:28
-
Unfortunately the Spark Context is not serializable. I have never tried to build a servlet that first thing starts a spark context and then keeps it active. Maybe using a static block at the beginning of the class could help. As for Kafka, no: all the queries should be in a single topic. If you decide to use multiple topics, they should direct to different applications, as they are divided either for workload or for logic. Furthermore, if you create a different topic each time, you should make your consumer subscribe to each one of them each time. Which is impossible. – Vale Jul 19 '16 at 13:07