-3

I have a huge txt file and I need to put it on DynamoDB. the file struct is:

223344|blue and orange|Red|16/12/2022

223344|blue and orange|Red|16/12/2022 ...

This file has more than 200M lines

I have tried to convert it on json file using this code bellow:

import json

with open('mini_data.txt', 'r') as f_in:
    for line in f_in:
        line = line.strip().split('|')        
        filename = 'smini_final_data.json'
        result = {"fild1": line[0], "field2": line[1], "field3": str(line[2]).replace(" ",""),"field4":line[3]}
        with open(filename, "r") as file:
            data = json.load(file)
        data.append(result)
        with open(filename, "w") as file:
            json.dump(data, file)

But this isn't efficient and it's only the first part of the job ( convert data to Json ), after this I need put the Json in dynamoDB.

I have used this code (it's look good):

    def insert(self):
        if not self.dynamodb:
            self.dynamodb = boto3.resource(
                'dynamodb', endpoint_url="http://localhost:8000")
        table = self.dynamodb.Table('fruits')

        json_file = open("final_data.json")
        orange = json.load(json_file, parse_float = decimal.Decimal)

        with table.batch_writer() as batch:
            for fruit in orange:
                fild1 = fruit['fild1']
                fild2 = fruit['fild2']
                fild3= fruit['fild3']
                fild4 = fruit['fild4']

                batch.put_item(
                    Item={
                        'fild1':fild1,
                        'fild2':fild2,
                        'fild3':fild3,
                        'fild4':fild4
                    }
                )

So, does anyone, have some suggestions to process this txt more efficiently?

Thanks

1 Answers1

1

The step of converting from delimited text to JSON seems unnecessary in this case. The way you've written it requires reopening and rewriting the JSON file for each line of your delimited text file. That I/O overhead repeated 200M times can really slow things down.

I suggest going straight from your delimited text to DynamoDB. It might look something like this:

dynamodb = boto3.resource(
    'dynamodb', endpoint_url="http://localhost:8000")
table = self.dynamodb.Table('fruits')

with table.batch_writer() as batch:
    with open('mini_data.txt', 'r') as f_in:
        for line in f_in:
            line = line.strip().split('|')
            batch.put_item(
                Item={
                    'fild1':line[0],
                    'fild2':line[1],
                    'fild3':str(line[2]).replace(" ",""),
                    'fild4':line[3]
                }
            )
Tyler Ham
  • 386
  • 2
  • 6
  • the time of this solutions is the similar of the question, around 30 mins to insert 500.000 items. So anyone have other suggestion? – Roger Monteiro Dec 28 '22 at 20:14