Spark reading from distributed file system?

Question

Say I have data(user events) stored in distributed file system like S3 or HDFS. User events are stored in a directory date wise.

Case 1 Consider that spark job need to read the data for a single day. My understanding is that single spark job will read the data from that day directory and read the data block by block , provide the data to spark cluster for computation. Will that block by block reading process be sequential ?

Case 2 Consider that spark job need to read the data for a more than a day(say 2 days) Question : Here job has to read the data from two separate directory. Do I need to start the two separate spark process(or threads) so that data read from separate directory can be executed in parallel ?

Does it matter how it reads? You can’t assume sequential read — Salim, Jan 12 '20 at 14:22
@Salim I think it matters . In case of parallel read, data reading will be much faster. See my updated post. — user3198603, Jan 13 '20 at 02:21
You can achieve this by bucketing and partitioning the data while saving it. Also use parquet file format. Spark will apply partition pruning and predicate push down to reduce the amount of data being read for a query. — Salim, Jan 13 '20 at 13:37

score 1 · Accepted Answer · answered Jan 13 '20 at 13:37

1

You can achieve this by bucketing and partitioning the data while saving it. Also use parquet file format which is columnar. Spark will apply partition pruning and predicate push down to reduce the amount of data being read for a query. Use multiple executers along with multiple partitions will help parallel processing of data.

answered Jan 13 '20 at 13:37

Salim

2,046
12
13

When you say multiple executors, Does it mean two separate threads in the same job or two separate instances altogether ? – user3198603 Jan 17 '20 at 09:00
Both cores and nodes are considered. The cluster will define how many executors are available and how many cores each executer takes. A job is submitted with a demand for n number of executors, based on that n amount of resources are assigned which will run your code as fast. Use `--total-executor-cores`. If you like the answer you can upvote and accept it. – Salim Jan 17 '20 at 14:59
Thanks for the vote! It takes time to write answers – Salim Jan 18 '20 at 02:26
When you say `A job is submitted with a demand for n number of executors` I believe when user submitting the job, he has to explicitly tell number of executors(workers) ? – user3198603 Jan 18 '20 at 03:41
Yes mention executors while submitting a job. On spark standalone cluster it takes all executors by default so you need not mention it. But in Yarn you need to mention. – Salim Jan 18 '20 at 03:49

Spark reading from distributed file system?

1 Answers1