Say I have data(user events) stored in distributed file system like S3 or HDFS. User events are stored in a directory date wise.
Case 1 Consider that spark job need to read the data for a single day. My understanding is that single spark job will read the data from that day directory and read the data block by block , provide the data to spark cluster for computation. Will that block by block reading process be sequential ?
Case 2 Consider that spark job need to read the data for a more than a day(say 2 days) Question : Here job has to read the data from two separate directory. Do I need to start the two separate spark process(or threads) so that data read from separate directory can be executed in parallel ?