I am trying to use Spark Streaming to collect data from CSV files located on NFS. The code I have is very simple, and so far I have been running it only in spark-shell, but even there I am running into some issues.
I am running spark-shell with a standalone Spark master with 6 workers, and passing the following arguments to spark-shell:
--master spark://master.host:7077 --num-executors 3 --conf spark.cores.max=10
This is the code:
val schema = spark.read.option("header", true).option("mode", "PERMISSIVE").csv("/nfs/files_to_collect/schema/schema.csv").schema
val data = spark.readStream.option("header", true).schema(schema).csv("/nfs/files_to_collect/jobs/jobs*")
val query = data.writeStream.format("console").start()
There are 2 files in that NFS path, each about 200MB in size. When I call writeStream, I get the following warning:
"17/11/13 22:56:31 WARN TaskSetManager: Stage 2 contains a task of very large size (106402 KB). The maximum recommended task size is 100 KB."
Looking in the Spark master UI, I see that only one executor was used - four tasks were created, each reading ~50% of each CSV file.
My questions are:
1) The more files there are in the NFS path, the more memory the driver seems to need - with 2 files, it would crash until I increased its memory to 2g. With 4 files it needs no less than 8g. What is the driver doing that it needs so much memory?
2) How do I control the parallelism of reading the CSV files? I noticed that the more files there are, the more tasks are created, but is it possible to control this manually?