1

I have a large json data file with 3.7gb. Iam going to load the json file to dataframe and delete unused columns than convert it to csv and load to sql. ram is 40gb My json file structure

{"a":"Ho Chi Minh City, Vietnam","gender":"female","t":"841675194476","id":"100012998502085","n":"Lee Mến"}
{"t":"84945474479","id":"100012998505399","n":"Hoàng Giagia"}
{"t":"841679770421","id":"100012998505466","n":"Thoại Mỹ"}

I try to load data but it fails because of out of memory

data_phone=[]
with open('data.json', 'r', encoding="UTF-8") as f:
    numbers = ijson.items(f, 't',multiple_values=True)
    for num in numbers :
        data_phone.append(num)

It shows errors

Out of memory

I try another way

import json fb_data={} i=1

with open('output.csv', 'w') as csv_file:
    with open("Vietnam_Facebook_Scrape.json", encoding="UTF-8") as json_file:
        for line in json_file:
            data = json.loads(line)
            try:
                csv_file.write('; '.join([str(i),"/",data["t"],data["fbid"]]))
            except:
                pass

Then I convert from csv to sql, it still show error "MemoryError:"

con = db.connect("fbproject.db")
cur = con.cursor()
with open('output.csv', 'r',encoding="UTF-8") as csv_file:
    for item in csv_file:
        cur.execute('insert into fbdata values (?)', (item,))
con.commit()
con.close()

Thanks for reading

  • ijson exists for precisely this purpose, but it would be easier to help you if you showed us the format of your json data – joao Feb 06 '21 at 17:04
  • Why do you need to create the list and the data frame at all? – juanpa.arrivillaga Feb 06 '21 at 17:06
  • @joao I just update my data structure. I can handle the same issue with a json file has 500mb, but the large file is 3.7gb, it show error – Dinh Truong Anh Phuong Feb 06 '21 at 17:31
  • @juanpa.arrivillaga my final purpose is load data to django. Step1 read json file, Step 2 load to dataframe, Step 3 save file as a csv, Step 4 load csv to sql, Step 5 load data to django to search – Dinh Truong Anh Phuong Feb 06 '21 at 17:33
  • 2
    @DinhTruongAnhPhuong that isn't answering my question. Why does it need to be a dataframe or a list? You are trying to ultimately write a csv file. Just create the csv file as you iterate over the json. – juanpa.arrivillaga Feb 06 '21 at 17:44

1 Answers1

3

Your proposal is:

  • Step 1 read json file
  • Step 2 load to dataframe
  • Step 3 save file as a csv
  • Step 4 load csv to sql
  • Step 5 load data to django to search

The problem with your second example is that you still use global lists (data_phone, data_name), which grow over time.

Here's what you should try, for huge files:

  • Step 1 read json
    • line by line
    • do not save any data into a global list
    • write data directly into SQL
  • Step 2 Add indexes to your database
  • Step 3 use SQL from django

You don't need to write anything to CSV. If you really want to, you could simply write the file line by line:

import json
with open('output.csv', 'w') as csv_file:
    with open("Vietnam_Facebook_Scrape.json", encoding="UTF-8") as json_file:
        for line in json_file:
            data = json.loads(line)
            csv_file.write(';'.join([data['id'], data['t']]))

Here's a question which might help you (Python and SQLite: insert into table), in order to write to a database row by row.

If you want to use your CSV instead, be sure that the program you use to convert CSV to SQL doesn't read the whole file but parse it line by line or in batch.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124