1

I have a task wherein I am receiving a list of files(size of each file is very small) which should be used for processing. There are millions of such files stored in my AWS S3 bucket, I need to filter and process only those files which are present in the above list.

Can anyone please let me know the best practice to do this in Spark?

Eg. There are millions of files present in AWS S3 bucket of XYZ university. Each file having a unique ID as a filename. I get the list of 1000 unique ID's to be processed. Now i have to do the processing only on these files to aggregate and generate an output csv file.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Kashif Hamad
  • 21
  • 10
  • Possible duplicate of [How to read multiple text files into a single RDD?](https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd) – Xavier Guihot Mar 06 '18 at 06:45

2 Answers2

1

After browsing for quite some time I understood the following.

    String[] list={"s3a://path1/file1","s3a://path1/file2" ...}
    JavaRDD<String> readRDD=spark.read.json(list);

reads the objects from S3 treating it like HDFS file system hence slowing the performance. Hence if S3 is the source then the following code would optimize the performance to a great extent, as we would be using the AWS SDK to fetch the objects from S3 and later create an RDD of the same.

    String[] list={"file1","file2" ...};
    JavaRDD<String> readRDD=sc.parallelize(Arrays.asList(list))
            .map(file->{
                AmazonS3Client s3client= new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
                BufferedReader reader = new BufferedReader(new InputStreamReader(s3client.getObject("Bucket-Name", file).getObjectContent()));
                String line;
                StringBuilder sb=new StringBuilder();
                while((line=reader.readLine())!=null) {
                    sb.append(line);
                }
                return sb.toString();
            });
Kashif Hamad
  • 21
  • 10
0

supply a comma delimited list of the paths to the files e.g. if these are json files

spark.read.json("s3a://path1/file1","s3a://path1/file2" ...)
Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • Would this be a good solution performance wise if the list of files increases from 1000 to 1m? – Kashif Hamad Mar 06 '18 at 12:55
  • If you have a large list you can use a glob syntax (if there's a pattern). if the number of files is close to all the files you might filter data later but otherwise you need to open all the files anyway., – Arnon Rotem-Gal-Oz Mar 06 '18 at 14:09