How to save my scraped data in AWS s3 bucket

Question

How to integrate my scrapping code with lambda_handler to save the data in s3 bucket. i am not able to save the data I have aws account not enterprise the account giving by aws fot 2.00. need to save the data in the s3 bucket. bucket name is 'my_bucket'. I am able to generate data.json file. How to save this data.json directly to my_content bucket using lambda handler in the AWS.

My Code for scraping is below

from bs4 import BeautifulSoup
import ssl
import json
import ast
import json
import os
from urllib.request import Request, urlopen
# For ignoring SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
def get_soup(url):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, 'html.parser')
    return soup
url = 'https://www.youtube.com/feed/trending'
soup=get_soup(url)
html = soup.prettify('utf-8')
video_details = {}

#All the trending youtube links
youtubelinks = []
for a in soup.select('a[href^="/watch?v="]')[:3]:
    youtubelinks.append("https://www.youtube.com"+ a['href'])
    youtubelink = list(dict.fromkeys(youtubelinks))

for link in youtubelink:
    link=get_soup(link)
    for span in link.findAll('span',attrs={'class': 'watch-title'}):
        video_details['TITLE'] = span.text.strip()
    print(video_details)
    with open('data.json', 'w', encoding='utf8') as outfile:
        json.dump(video_details, outfile, ensure_ascii=False,indent=4)

AWS, I have wrote the code to put in s3 bucket also. How to integrate between two

import boto3   
import tempfile
def lambda_handler(event, context):
    bucket_name = "my_content"
    file_name = "data.json"
    lambda_path = "/tmp/" + file_name
    s3_path = "/100001/20191010/" + file_name    
    s3 = boto3.client('s3', aws_access_key_id = access_key, aws_secret_access_key = secret_key, region_name = region)
    data_bin = open(file_name,'r')
    data = data_bin.read()
    s3.Bucket(bucket_name).put_object(Key=s3_path, Body=data)
    #temp = tempfile.TemporaryFile()
    #s3.put_object(temp, Bucket = 'my_content', Key = 'data.json')
    #temp.close()

**Side-note:** Please note that there is no such thing as an "aws free account". Rather, the [Free Usage Tier](https://aws.amazon.com/free/) is a billing discount that provides a certain amount of services at no charge during the first 12 months of an AWS account. If the services are used beyond the stated amounts, they will be charged as normal. The fact that you are receiving a billing discount has no impact on the behaviour of the services used. — John Rotenstein, Oct 11 '19 at 23:03
free account in the sense its not the enterprise account. The account giving for 2.00 — , Oct 12 '19 at 00:31

score 3 · Answer 1 · edited Oct 11 '19 at 23:03

3

Here's how you can save data (json file) to S3:

Make sure AWS IAM Role attached to Lambda has write permissions to access the S3 bucket in which you're trying to upload file.
Scrape the data, write to file and store it in /tmp folder.
Upload the file from /tmp directory using Boto 3's S3 client put_object function.

edited Oct 11 '19 at 23:03

John Rotenstein

241,921
22
380
470

answered Oct 11 '19 at 17:21

Hassan Murtaza

963
6
15

score 3 · Answer 2 · answered Jun 22 '20 at 12:47

i have written a full sample:

create a bucket:

import boto3

s3 = boto3.resource('s3')

# create a bucket
myBucket = 'stackoverflow2'

try:
    s3.create_bucket(Bucket=myBucket, CreateBucketConfiguration={"LocationConstraint": "eu-central-1"})
except:
    pass

list all buckets:

# Retrieve the list of existing buckets
s3 = boto3.client('s3')

# list all buckets
response = s3.list_buckets()

# Output the bucket names
print()
print('Existing buckets:')
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')

upload a file:

# Upload a file
print()
filename = 'stackoverflow.json'
with open(filename, 'r') as line:
    print(line.read())

# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, myBucket, filename)

list content of bucket:

#liste content
for key in s3.list_objects(Bucket=myBucket)['Contents']:
    print(key['Key'])

output:

Existing buckets:
  stackoverflow2
  terra-form-serverless

{"test": {
  "id": "1",
  "value": "2",
  "attribute": {
    "sub": [
      {"value": "1", "2": "3"},
      {"value": "4", "5": "6"},
      {"value": "7", "8": "9"}
    ]
  }
}}

stackoverflow.json