0

I have an application that uses Stocator as a connector for Spark. This application writes the data to the S3 cos bucket.

Now I am working on a service that's supposed to read that data from S3. According to this thread here, you cannot specify the uri/protocol that boto3 uses. Is it safe to read that data using the default protocol of S3 REST API?

The reason I am asking is that I have been told that reading data using S3A (another protocol) that has been written using Stocator could result in reading duplicates.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • 2
    If S3A results in duplicates, that's because it was written with duplicates... boto3 would not dedupe the data, either. Also `s3a://` is only a JVM hadoop-aws S3 SDK protocol, not boto, and the linked answer is 6 years old – OneCricketeer Jan 10 '23 at 01:00
  • @OneCricketeer thanks for your comment. Can you elaborate more on how S3A is written with duplicates? – BovineScatologist Jan 10 '23 at 13:29
  • 1
    I haven't not reviewed the project you've linked to, but when you say "you've been told reading data written using Stocator could result in duplicates", then that is simply because **duplicates were written**. Boto3 wouldn't remove duplicated data when reading. That is not related to the "protocol", so I'm not sure why you're trying to find an "alternative one" – OneCricketeer Jan 10 '23 at 17:06

0 Answers0