My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache. But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
- Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
- Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
- Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.