3

I have a use case where I need to make a S3 navigator which should allow users to navigate s3 files and view them without giving any sort of aws access. So users need not have aws credentials configured on their systems.

The approach I tried is to create a python app using tkinter and allow access to s3 using api gateway proxy to s3 docs. However, all this works fine for txt files in s3 but I have to read feather files and it's causing

s3_data=pd.read_feather("https://<api_gateway>/final/s3?key=naxi143/data.feather")
  File "C:\Users\<User>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\io\feather_format.py", line 130, in read_feather
    return feather.read_feather(
  File "C:\Users\<User>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pyarrow\feather.py", line 218, in read_feather
    return (read_table(source, columns=columns, memory_map=memory_map)
  File "C:\Users\<User>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pyarrow\feather.py", line 239, in read_table
    reader = _feather.FeatherReader(source, use_memory_map=memory_map)
  File "pyarrow\_feather.pyx", line 75, in pyarrow._feather.FeatherReader.__cinit__
  File "pyarrow\error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 114, in pyarrow.lib.check_status
OSError: Verification of flatbuffer-encoded Footer failed.

error in my python code. Not sure if some settings are misconfigured on api gateway side.

s3_data=pd.read_feather("https://<api_gateway>/final/s3?key=naxi143/data.feather")

Is there any other way to make this work without involving aws credentials ?

Update

Looks like api gateway has a payload limit of 10 MB which leaves this solution out of scope for me as most of my data is more than that size. Is there any other way to achieve the same without using aws credentials ?

Naxi
  • 1,504
  • 5
  • 33
  • 72
  • Please show the traceback you get for that exception, not just the error message. It sounds like your data is getting truncated, perhaps. – AKX Aug 12 '21 at 10:56
  • @AKX updated full error. – Naxi Aug 12 '21 at 10:59
  • 1
    Okay, thanks. As a first debugging step (to confirm my hunch about truncation) can you download the file with your browser (or curl or whatnot) via the API Gateway URL and separately from the S3 console? Just to make sure it's not the API Gateway messing up your file. – AKX Aug 12 '21 at 11:01
  • That's correct. Api gateway is somehow messing up the feather file. I downloaded the file using curl via api gateway url and also downloaded directly from s3 console. The one from s3 console gets read just fine from the python code while the one from api gateway url throws the same error as above. – Naxi Aug 12 '21 at 12:12
  • What about the file sizes? Is the API Gatewayed file e.g. smaller than the correct file, or is it the right size but internally corrupt otherwise? If the latter, can you e.g. hex diff the two? – AKX Aug 12 '21 at 12:16
  • Also, based on [this documentation](https://docs.aws.amazon.com/apigateway/latest/developerguide/integrating-api-with-aws-services-s3.html#api-items-in-folder-as-s3-objects-in-bucket) it could be that you need to explicitly specify all the binary content-types you don't want API Gateway to do anything for, and additionally have the client send a suitable `Accept` header (which you'd need `requests` for, not just `pd.read_feather()`)... – AKX Aug 12 '21 at 12:19
  • Will try this out. Also, just now realized that there is a size limit of 10 MB on the payload. My files are mostly greater than that. So doesn't look like this approach will work for me. Any other way you think this is possible without using aws credentials ? – Naxi Aug 12 '21 at 13:54
  • 1
    Instead of using the API gateway service, you could just write your own web service that acts as a REST-ish proxy to your s3 and run that on top of ec2 or ecs. – larsks Aug 13 '21 at 11:51
  • Are you feather files stored compressed or uncompressed in S3? – Life is complex Aug 15 '21 at 19:25
  • they are uncompressed – Naxi Aug 16 '21 at 07:13

2 Answers2

0

The intake server can be used as a data gateway, if you wish, and Intake's plugins allow communication with S3 natively via fsspec/s3fs. Intake deals in datasets not files, so you would want to find the correct invocation for each dataset you want to read (i.e., set of arguments that pandas would normally take) and write descriptions and metadata before launching the server.

There is no feather driver, however (unlike parquet), although one would be easy to write. The intake-dremio package, for instance, already interfaces with arrow transport directly.

mdurant
  • 27,272
  • 5
  • 45
  • 74
0

I think the solution you're looking for is API Gateway + pre-signed S3 URLs + 303 HTTP redirects + CORS. That securely gets around the API gateway limit because it uses a signed redirect to the S3 object. Here is a really good explanation of how to set that up:

https://advancedweb.hu/how-to-solve-cors-problems-when-redirecting-to-s3-signed-urls/

It essentially comes down setting some headers in the REST call to configure CORS to allow a client to receive a 303 redirect to a different domain (CORS is security against cross scripting type of attacks). But because there are security implications I suggest reading the whole article and understand what you're allowing rather than just copying the header names and values.

CognizantApe
  • 1,149
  • 10
  • 10