0

I'm building a TFX pipeline that contains images as input from an S3 bucket. At the TF Transform component step, I'm attempting to read in a series of images with their URLs stored in TFX's SparseTensor format. I'm trying to use the S3FS Python module to do so as I've been using that for other components of my pipeline and have heard using both Boto3 and S3FS together can cause issues (this is beside the point I think).

Anyway, I've established a connection to the S3 bucket and am attempting to read in images. Here is my code (or at least the part of it I think is germane to the issue):

  s3 = s3fs.S3FileSystem()

  with s3.open(str(inputs[key]), 'rb') as f:
    for key in CV_FEATURES:
      img = np.array(Image.open(io.BytesIO(f.read())))
      img = tf.image.rgb_to_grayscale(img)
      img = tf.divide(img, 255)
      img = tf.image.resize_with_pad(img, 224, 224)
      outputs[_fill_in_missing(key)] = img

  s3.clear_instance_cache()

Running this gives me the standard error message I've seen for trying to access buckets with invalid characters:

ParamValidationError: Parameter validation failed: Invalid bucket name "SparseTensor(indices=Tensor("inputs": Bucket name must match the regex "^[a-zA-Z0-9.-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).:(s3|s3-object-lambda):[a-z-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-]{1,63}$|^arn:(aws).:s3-outposts:[a-z-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9-]{1,63}$"

The error indicates the problem is with the line with s3.open(str(inputs[key]), 'rb') as f: so somehow I need to represent the S3 URL correctly. The URLs are stored in the format bucket_name\key\file.jpg in a column called image_path in the original CSV dataset (converted to a SparseTensor before this point in represented in the above code as inputs[key]).

I don't think the issue is with the SparseTensor format, but rather the URL.

Andrew Gaul
  • 2,296
  • 1
  • 12
  • 19
John Sukup
  • 303
  • 3
  • 11
  • what does print(str(inputs[key]) look like? Is it 'bucket_name\key\file.jpg' ? – Jonathan Leon Jun 06 '21 at 01:50
  • I tried that out and it looks like since my "with" statement comes before the for-loop referencing the exact column I'm trying to pull the URL from that it's pulling in the wrong column. This makes me think I should flip-flop the "with" and "for" statements, but when I do, I get a " KeyError: 'i' " – John Sukup Jun 08 '21 at 02:07
  • what does for key in CV_FEATURES: print(key) get you? if key is the form of 'bucket_name\key\file.jpg', maybe you don't str(inputs[key]. you just need key – Jonathan Leon Jun 08 '21 at 03:37
  • Strangely enough, during my trials on this, sometimes I'd get a "KeyError: 'i'" which I couldn't figure out. After trying your suggestion, it's because the "key" getting printed and causing the error is the first character of the column I'm trying to use in the Transform: "image_path". I think this may have to do with the way TFX works with SparseTensors/Tensors. I'll report back... – John Sukup Jun 10 '21 at 01:05
  • Can you take a look at this [link](https://stackoverflow.com/a/62349388/11530462) which discusses about a similar problem and let us know if that helps. Thanks! –  Mar 21 '22 at 16:50

0 Answers0