0

I have a bucket that holds a massive amount of data and I want to get only specific objects (files) that contain a string (UUID which is part of the file).

Now what I am doing is listing all the objects from s3 and then filter them by getting summaries which only contains a specific string then I gather all in a list and return the list with needed files.

public List<String>getBucketList(String filterStr) {

        List<String>lst = new ArrayList<>();
        try {
            ListObjectsRequest listObjectsRequest =
                    new ListObjectsRequest()
                            .withBucketName(bucketName);
            ObjectListing objects = s3client.listObjects(listObjectsRequest);
            for (;;) {
                List<S3ObjectSummary> summaries = objects.getObjectSummaries();
                if (summaries.size() < 1) {
                    break;
                }
                for(S3ObjectSummary summary: summaries){
                    if(summary.getKey().contains(filterStr)){
                        lst.add(summary.getKey());
                    }
                }
                objects = s3client.listNextBatchOfObjects(objects);
            }
}

Expected: from the listing I want to get only the objects that are relevant to me which contains 'filterStr'(variable name which its value is UUID number). Actual: After getting all the objects I am filtering the needed files (objects) by verifying them containing the string (variable name filterStr) this action eventually does what I was intending to do but it takes a lot of time which I wonder if I can minimize.

EDIT: Inside My bucket I got multiple folders, for example:

alert
alert_archived
channel
device

Inside each folder I have a date which is represented this way:

alert 2019 08 26

example for a file that I want to get is represented in this convention:

s3://<bucket_name>/<name_of_folder_out_of_many>/2019/08/25/<UUID>_<name_of_the_file>.csv.gz

where I want to iterate over all folders in the bucket and get only files that are with this specific UUID_.csv.gz of course current date is important I want to get only current date.

tupac shakur
  • 658
  • 1
  • 12
  • 29
  • Check here https://stackoverflow.com/questions/4979218/how-do-you-search-an-amazon-s3-bucket/21836343#21836343 – muasif80 Aug 25 '19 at 15:17
  • I'm confused. Are you seeking objects with a particular thing in their filename (Key), or do you wish to search _inside_ the objects? What is the format of the objects (eg CSV)? – John Rotenstein Aug 26 '19 at 04:14
  • EDIT: I will edit my question @JohnRotenstein – tupac shakur Aug 26 '19 at 06:34
  • So, you just want to get a listing of objects that match that particular pattern? Are you doing this on a regular basis? Have you considered using [Amazon S3 Inventory - Amazon Simple Storage Service](https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html) to obtain a regular (eg daily) listing of objects? – John Rotenstein Aug 26 '19 at 06:43
  • We are using S3 bucket to store each folder (we are not using this service - Amazon S3 Inventory) we have a simple S3 bucket which is divided as the convention I mentioned above. and yes all I wanna do is get a list of objects which are listed @JohnRotenstein – tupac shakur Aug 26 '19 at 07:03
  • I suggest you take a look at S3 Inventory, which can give you a daily list of all files, without having to make API calls. – John Rotenstein Aug 26 '19 at 07:30
  • I would use it if I would have queried the list every time and check MD5 to see changes, this is not the case, I would like to know if I can modify my code in order to make it efficient or there is no other way of doing it except the way I did it. – tupac shakur Aug 26 '19 at 08:03

0 Answers0