I have a task wherein I am receiving a list of files(size of each file is very small) which should be used for processing. There are millions of such files stored in my AWS S3 bucket, I need to filter and process only those files which are present in the above list.
Can anyone please let me know the best practice to do this in Spark?
Eg. There are millions of files present in AWS S3 bucket of XYZ university. Each file having a unique ID as a filename. I get the list of 1000 unique ID's to be processed. Now i have to do the processing only on these files to aggregate and generate an output csv file.