Copy CSV files from a public Git subdirectory to an S3 Bucket

Question

I see that there are multiple methods to do so, but I was not able to do it using AWS Lambda (I may be missing something there) Any recommendations on the method and preferable a link related to the implementation steps would be useful. The public Git link is huge, however, I need the csv files from the subdirectory only.

Could you please clarify what you are wanting to do? Are you wanting to clone the _entire_ repo into S3, or just selected files? How big is the repo? (Lambda has a limit of 512MB disk storage unless you do fancy stuff.) — John Rotenstein, Jul 19 '20 at 01:53

score 1 · Accepted Answer · answered Jul 19 '20 at 12:02

Any git repository provides you a raw link - for example, https://github.com/thephpleague/csv/raw/master/tests/data/foo.csv You can use your favorite http client in your favorite runtime, to pull this file down.

If you feel the file is too huge to fit in 512MB, you can mount an EFS (https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/).

And if it is so large that you cannot download it within 15 minutes, you can try to download in parts - across multiple lambda invocations. You can save the resume status on EFS. In fact, you can also store the resume info in the /tmp folder in the lambda. You will get it back if the second lambda invocation is quick enough.

Hope that answers your question.

score 0 · Answer 2 · answered Jul 23 '20 at 14:15

You can use sparse checkout and shallow clone for this. This example will pull from csvsubfolder

git init <repo>
cd <repo>
git remote add origin <url>
git config core.sparsecheckout true
echo "csvsubfolder/*" >> .git/info/sparse-checkout
git pull --depth=1 origin master

Copy CSV files from a public Git subdirectory to an S3 Bucket

2 Answers2