0

I have made a web scraper that downloads a bunch of PDFs. The script is basically a loop and downloads a PDF(~8 MB) per iteration. The total file size is estimated to be >300GB. I was thinking that instead of creating an instance with that much storage, why not put the PDFs in an S3 bucket as soon as they are downloaded.

I will be using a t2.xlarge ubuntu system. The loop is supposed to run for 2 weeks, so I believe it will be cheaper to use S3 bucket instead of buying extra storage for t2.

The thing is that the script downloads the PDFs in the /Downloads folder. I think I need to mount a bucket using s3fs? Then I will recursively copy the files in the Downloads folder and paste them in the mounted bucket, and then use rm to delete everything in the \Downloads folder. Is this the way to go is there a more straightforward way?

Any help or documentation link would be appreciated! Thanks!

Related posts:

Vibhu
  • 361
  • 2
  • 11

1 Answers1

0

You can do much simpler with AWS Lambda.

Create a trigger event with AWS Lambda. Then pull the pdf file and directly save it to S3.

Cloud Watch Events (cron) --> Lambda --> S3

With this you pay only for the amount of time you run the code. No need to pay any fixed fee.

If you are a command line person only comfortable only with CLI, you can run the script to download file and save it to S3.

curl "https://linktopdf/" | aws s3 cp - s3://bucket/filename

You can just use a t2.small for this purpose.

Hope it helps.

Kannaiyan
  • 12,554
  • 3
  • 44
  • 83
  • The script I have made uses Selenium and Chrome. Will it work in a Lambda function? – Vibhu Aug 10 '19 at 04:46
  • You can run selenium and chrome under headless mode in lambda. More info can be found under, https://medium.com/clog/running-selenium-and-headless-chrome-on-aws-lambda-fb350458e4df – Kannaiyan Aug 10 '19 at 07:29
  • Thanks but I can't find a tutorial to do that – Vibhu Aug 10 '19 at 13:03