0

I am able to read and write a single json record from S3 bucket to dynamodb. However, when I try to read and write from the file with multiple json objects in it, it gives me error. Please find the code and error below - Request you to please help resolve the same - Lambda Code (Reads S3 file and writes to dynamodb)

import json
import boto3

s3_client = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')

def lambda_handler(event, context):
    # TODO implement
    bucket = event['Records'][0]['s3']['bucket']['name']
    json_file_name = event['Records'][0]['s3']['object']['key']
        
    print(bucket)
    print(json_file_name)
    json_object = s3_client.get_object(Bucket=bucket, Key=json_file_name)
    jsonFileReader = json_object['Body'].read()
    print(jsonFileReader)
    
    jsonFile = json.loads(jsonFileReader)
    
    print(jsonFile)
    print(type(jsonFile))
    
    jsonDict = {"test":item for item in jsonFile}
    print(type(jsonDict))
    print(jsonDict)
    
    table = dynamodb.Table('Twitter-data-stream')
    print(type(table))
    
    table.put_item(Item=jsonDict['test'])
    return 'Hello from Lambda!'

Error in cloudwatch -

[ERROR] JSONDecodeError: Extra data: line 1 column 230 (char 229)
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 20, in lambda_handler
    jsonFilerec = json.loads(jsonFileReader)
  File "/var/lang/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/var/lang/lib/python3.8/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)

Please find the S3 file sample records below

b'[{"id": "1305857561179152385", "tweet": "If you like vintage coke machines and guys who look like Fred Flintstone you\'ll love the short we\'ve riffed: Coke R, "ts": "Tue Sep 15 13:14:38 +0000 2020"}][{"id": "1305858267067883521", "tweet": "Chinese unicorn Genki Forest plots own beverage hits  #China #Chinese #Brands #GoingGlobal\\u2026 ", "ts": "Tue Sep 15 13:17:27 +0000 2020"}][{"id": "1305858731293507585", "tweet": "RT @CinemaCheezy: If you like vintage coke machines and guys who look like Fred Flintstone you\'ll love the short we\'ve riffed: Coke Refresh\\u2026", "ts": "Tue Sep 15 13:19:17 +0000 2020"}]'

Adding the Producer / Input code which generates the Json file

import boto3
import json
from datetime import datetime
import calendar
import random
import time
import sys
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import preprocessor as p


#Variables that contains the user credentials to access Twitter API
consumer_key = '************'
consumer_secret ='******************'
access_token = '********************'
access_token_secret = '***************'

# Create tracklist with the words that will be searched for
tracklist = ['#coke']

awsRegionName='us-east-1'
awsAccessKey='************'
awsSecretKey='**********'

class TweetStreamListener(StreamListener):
    # on success
    def on_data(self, data):
        # decode json
        tweet = json.loads(data)
        print(type(tweet))
        #print(tweet)
        if "text" in tweet.keys():
            payload = {'id': str(tweet['id']),
                       'tweet': str(tweet['text'].encode('utf8', 'replace')),
                       #'tweet': str(tweet['text']),
                       'ts': str(tweet['created_at']),
                       },
                       
            try:
                print(tweet)
                #print(payload)
                
                               
                put_response = kinesis_client.put_record(
                    StreamName=stream_name,
                    Data=json.dumps(payload),
                    PartitionKey=str(['screen_name']))
                    #PartitionKey=str(tweet['user']['screen_name']))
            except (AttributeError, Exception) as e:
                print(e)
                pass
        return True

    # on failure
    def on_error(self, status):
        print("On_error status:", status)


stream_name = 'twitter-data-stream'  # fill the name of Kinesis data stream you created
#stream_name = 'demo-datastream' 

if __name__ == '__main__':
    # create kinesis client connection
    kinesis_client = boto3.client('kinesis',
                                  region_name=awsRegionName,
                                  aws_access_key_id=awsAccessKey,
                                  aws_secret_access_key=awsSecretKey)
    
    # create instance of the tweepy tweet stream listener
    listener = TweetStreamListener()
    # set twitter keys/tokens
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    # create instance of the tweepy stream
    stream = Stream(auth, listener)
    # search twitter for tags or keywords from cli parameters
    #query = sys.argv[1:]  # list of CLI arguments
    #query_fname = ' '.join(query)  # string
    stream.filter(track=tracklist)
    #tweets = api.search(tracklist, count=10, lang='en', exclude='retweets',tweet_mode = 'extended')
                         
    
    

                         
    

Regards, Priti

1 Answers1

0

Maybe your json is incorrect : [tweet_data][...][...][...] is not a valid json object. You should work on your input data, to have something like this : [{tweet_data},{...},{...},{...},{...}]

Julien
  • 508
  • 2
  • 6
  • 13
  • It works fine for a single item in the same format. – Priti Palekar Sep 16 '20 at 05:16
  • It works fine for a single item in the same format. Also if you check the S3 sample record set, it is as per json expected format – Priti Palekar Sep 16 '20 at 05:33
  • no it's not. A single array [] is valid, hence it's working. But multiple array is not valid json content ([...][...][...]).https://www.json.org/json-en.html https://jsonlint.com/ – Julien Sep 16 '20 at 07:13
  • Thanks Julien. You are right. I have updated my producer / input code which generates this json file. Can you please have a look and suggest what change i need to do to create a single array of json ? – Priti Palekar Sep 16 '20 at 07:58
  • Sorry, not familiar with that type of code. You should close this post, as it is answered (errors coming from an incorrect json) an ask an other question regarding the json output of your producer. – Julien Sep 16 '20 at 08:19