1

My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.

The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.

The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache. But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.

We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.

What I have already tried:

  • Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
  • Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
  • Try to use Cassandra but Cassandra does not support GROUP BY.

So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.

halfer
  • 19,824
  • 17
  • 99
  • 186
giaosudau
  • 2,211
  • 6
  • 33
  • 64

1 Answers1

3

If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.

So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.

Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.

Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • Thanks. What happen if we have multiple requests at the time? Is that need to wait in the queue to excute? Any solution for update the cached table without down time? – giaosudau May 18 '16 at 07:01
  • 1
    @giaosudau Multiple queries to the same temp table are managed by the Spark cluster scheduler, that you can configure to be FIFO or FAIR (round-robin). To update a cached table, you can do `sqlContext.uncaceTable("name")` and then `newDF.registerTempTable("name")`, followed by a new `sqlContext.cacheTable("name")`. – Daniel de Paula May 18 '16 at 17:18
  • Still confuse when update cached table. If we clear the current first and update a new one so it means we have down time in between (waiting to load current data to cache). I just want update 1 small amount data every hour and the day > 90 days ago will remove from the cache. Is there anyway to do that? Thanks. – giaosudau May 19 '16 at 03:15
  • @giaosudau, you don't need to have down time. You can build a job that queries on cassandra like `val df = cc.sql("select * from keyspace.cTable where day <= 90")`. Then you run this job every hour, but followed by `df.persist().count()`, this will force the data to be persisted in memory but will not affect your previous table. Then you can recreate your temp table: `cc.uncacheTable("table")` followed by `df.registerTempTable("table")` and `cc.cacheTable("table")`. This will be instantaneous , because `df` is already in memory. – Daniel de Paula May 19 '16 at 12:10
  • @DanieldePaula , Thanks I was searching for the exact content , but in my case I am not using Jobserver . In this case how can cache the data and make it available for all other spark jobs , can you please help me in this ? – Gowtham SB Apr 28 '19 at 07:52