0

Goal: Use S3 Select to extract columns from a .parquet on S3.

I've tried various queries. Including the key in the query makes no difference.

Code:

    s3 = boto3.client('s3')
    s3_uri = 's3://my-bucket/my-folder/'
    bucket, prefix = s3_uri[5:].rstrip('/').split('/', 1)

    key = f"{prefix}/{idx.split('-')[0]}.parquet"
    
    cols = ['col1', 'col2']
    query = f"SELECT {', '.join(cols) if isinstance(cols, list) else str(idx)} FROM s3object s WHERE s.key = '{key}'"
    # 1 cols could be either one string, one-to-may members in list

    print(bucket, prefix)
    print(key)
    print(query)

    response = s3.select_object_content(
        Bucket=bucket,
        Key=key,
        ExpressionType='SQL',
        Expression=query,
        InputSerialization={'Parquet': {}},
        OutputSerialization={'JSON': {}}
    )

    data = ''.join(
        event['Records']['Payload'].decode()
        for event in response['Payload']
        if 'Records' in event
    )
    df = pl.read_json(io.StringIO(data), lines=True)

Traceback:

(venv) me@laptop:~/BitBucket/project$ python extract_parquet_column.py 
my-bucket my-folder
my-folder/01A450012.parquet
SELECT 01A450012-5-1767-474-256-256 FROM s3object s WHERE s.key = 'my-folder/01A450012.parquet'

Traceback (most recent call last):
  File "/home/me/BitBucket/project/extract_parquet_column.py", line 48, in getitem
    response = s3.select_object_content(
  File "/home/me/miniconda3/envs/venv/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/me/miniconda3/envs/venv/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ParseSelectMissingFrom) when calling the SelectObjectContent operation: Missing FROM after SELECT list at line 1, column 17.
DanielBell99
  • 896
  • 5
  • 25
  • 57

2 Answers2

0

The SELECT-clause needs a prefix s to indicate this field originates in s3object. Because the name contains dashes, it will also need double quotes. Try the following query instead:

SELECT s."01A450012-5-1767-474-256-256" FROM s3object s WHERE s.key = 'my-folder/01A450012.parquet'
Bert Blommers
  • 1,788
  • 2
  • 13
  • 19
  • Adding `s.`: `query = f"SELECT s.{', '.join(idx) if isinstance(idx, list) else str(idx)} FROM s3object s WHERE s.key = '{key}'"` -> `botocore.exceptions.ClientError: An error occurred (ParseSelectMissingFrom) when calling the SelectObjectContent operation: Missing FROM after SELECT list at line 1, column 9.` – DanielBell99 Apr 20 '23 at 09:48
  • Also adding single quotes: `query = f"SELECT s.'{', '.join(idx) if isinstance(idx, list) else str(idx)}' FROM s3object s WHERE s.key = '{key}'"` -> `botocore.exceptions.ClientError: An error occurred (ParseInvalidPathComponent) when calling the SelectObjectContent operation: Invalid Path component, expecting either an IDENTIFIER or STAR, got: LITERAL,at line 1, column 10.` – DanielBell99 Apr 20 '23 at 09:51
0

All other variables were correct.

However, using either single-quotes or double quotes changes the behaviours of what the SQL query returns :/

query = f"""SELECT "{', '.join(idx) if isinstance(idx, list) else idx}" FROM S3Object s"""
>>> ... {"01A450012-5-1767-474-256-256":215}
query = f'''SELECT '{", ".join(idx) if isinstance(idx, list) else idx}' FROM S3Object s'''
>>> ... {"_1":"01A450012-5-1767-474-256-256"}

Not sure if this is a feature or a bug.

DanielBell99
  • 896
  • 5
  • 25
  • 57