0

I have a small csv file that looks like that :

is_employee,candidate_id,gender,hesa_type,university
FALSE,b9bb80,Male,Mathematical sciences,Birmingham
FALSE,8e552d,Female,Computer science,Swansea
TRUE,2bc475,Male,Engineering & technology,Aston
TRUE,c3ac8d,Female,Mathematical sciences,Heriot-Watt
FALSE,ceb2fa,Female,Mathematical sciences,Imperial College London

The following lambda function is used to query from an s3bucket.

import boto3
import os
import json

def lambda_handler(event, context):
    BUCKET_NAME = 'foo'
    KEY = 'bar/data.csv'  
    s3 = boto3.client('s3','eu-west-1')
    response = s3.select_object_content(
        Bucket = BUCKET_NAME,
        Key = KEY,
        ExpressionType = 'SQL',
        Expression = 'Select count(*) from s3object s where s.gender like \'%Female%\'',
        InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
        OutputSerialization = {'JSON': {}},
    )

    for i in response['Payload']:
        if 'Records' in i:
            query_result = i['Records']['Payload'].decode('utf-8')

    print(list(json.loads(query_result).values())[0])

Now, this works great as I get back a result of 3. But for some reason the same code does not work when changing the like operator to =, results drop down to 0, so no match found. What's happening here ?

Louis GRIMALDI
  • 101
  • 1
  • 12
  • 1
    Show your actual code with the `=` in it. Are you removing the `%` characters? – Mark B Feb 06 '20 at 15:54
  • I agree, syntax looks correct for your expression (assuming you're just replacing the like with `=`). You need to show the full statement using `=` – Alex Feb 06 '20 at 16:41
  • ```Expression = 'Select count(*) from s3object s where s.gender = \'%Female%\''``` – Louis GRIMALDI Feb 06 '20 at 16:57
  • This comes back as 0. – Louis GRIMALDI Feb 06 '20 at 16:57
  • Ok so @MarkB is right - I forgot to remove the ```%``` placeholders, so now ```=``` works for all columns except the last one: ```Expression = 'Select count(*) from s3object s where s.university like \'%Swansea%\''```returns ```1``` - ```Expression = 'Select count(*) from s3object s where s.university = \'Swansea\''```returns ```0```. It all works except for the last column, university. – Louis GRIMALDI Feb 06 '20 at 17:02
  • Are there any whitespace characters at the end of the line? – EpicVoyage Feb 07 '20 at 10:56

1 Answers1

1

So I found the problem. The problem was that the items of the last column were followed by a newline character, which was not understood by the AWS S3 interpreter. So really, a university name was not Swansea, but more Swansea\n.

So s.university = \'Swansea\'' does not work; however, s.university LIKE \'Swansea%\'' does work, and is still a sargable expression.

Louis GRIMALDI
  • 101
  • 1
  • 12