3

good afternoon. I am hoping that someone can help me with this issue.

I have multiple CSV files that are sitting in an s3 folder. I would like to use python without the Pandas, and the csv package (because aws lambda has very limited packages available, and there is a size restriction) and loop through the files sitting in the s3 bucket, and read the csv dimensions (length of rows, and length of columns)

For example my s3 folder contains two csv files (1.csv, and 2 .csv) my code will run through the specified s3 folder, and put the count of rows, and columns in 1 csv, and 2 csv, and puts the result in a new csv file. I greatly appreciate your help! I can do this using the Pandas package (thank god for Pandas, but aws lambda has restrictions that limits me on what I can use)

AWS lambda uses python 3.7

xboxuser
  • 160
  • 1
  • 11
  • 1
    you know you can use `pandas` in AWS Lambda you just have to zip up the packaged dependency with the rest of your scripts – gold_cy Mar 07 '19 at 03:17
  • 1
    I think you forgot to set up permission for the `Lambda function` in the lambda dashboard. You need to make sure s3 buckets are accessable from `Lambda`. it is more a question about `serverless` and `lambda` than `python` – tim Mar 07 '19 at 03:22
  • thank you sir aws_apprentice, I was also exploring that option. one of my co workers used that method, and he mentioned that we lose the ability to look into the code so I didn't explore that further, but I will explore it as well. That will be so much easier! – xboxuser Mar 07 '19 at 03:23
  • Hi Tim good afternoon. My permissions are setup correctly, as I am able upload / remove files using boto3 in lambda. I can explore more around that to make sure nothing else is missing. Thank you – xboxuser Mar 07 '19 at 03:26

1 Answers1

4

If you can visit your s3 resources in your lambda function, then basically do this to check the rows,

def lambda_handler(event, context):
    import boto3 as bt3
    s3 = bt3.client('s3')
    csv1_data = s3.get_object(Bucket='the_s3_bucket', Key='1.csv')
    csv2_data = s3.get_object(Bucket='the_s3_bucket', Key='2.csv')

    contents_1 = csv1_data['Body'].read()
    contents_2 = csv2_data['Body'].read()
    rows1 = contents_1.split()
    rows2=contents_2.split()    
    return len(rows1), len(rows2)

It should work directly, if not, please let me know. BTW, hard coding the bucket and file name into the function like what I did in the sample is not a good idea at all.

Regards.

tim
  • 1,454
  • 1
  • 25
  • 45