I've set up a spark-jobserver to enable complex queries on a reduced dataset.
The jobserver executes two operations:
- Sync with the main remote database, it makes a dump of some of the server's tables, reduce and aggregates the data, save the result as a parquet file and cache it as a sql table in memory. This operation will be done every day;
- Queries, when the sync operation is finished, users can perform SQL complex queries on the aggregated dataset, (eventually) exporting the result as csv file. Every user can do only one query at time, and wait for its completion.
The biggest table (before and after the reduction, which include also some joins) has almost 30M of rows, with at least 30 fields.
Actually I'm working on a dev machine with 32GB of ram dedicated to the job server, and everything runs smoothly. Problem is that in the production one we have the same amount of ram shared with a PredictionIO server.
I'm asking how determine the memory configuration to avoid memory leaks or crashes for spark.
I'm new to this, so every reference or suggestion is accepted.
Thank you