0

I have a group of .jsonl.gz files. I can read them using the script:

 import json
 import gzip
 with gzip.open(filepath, "r") as read_file:  # file path ends with .jsonl.gz
     try:
         # read gzip file which contains a list of json files (json lines)
         # each json file is a dictionary of nested dictionaries
         json_list = list(read_file)
     except:
         print("fail to read thezip ")

Then I do some processing and get some .json files and store them in a list.

for num, json_file in enumerate(json_list):        
        try:
            j_file = json.loads(json_file)
            (...some code...)
        except:
           print("fail")

My question is what is the right way to write them again into .jsonl.gz again?

This is my attempt

jsonfilename = 'valid_' +str(num)+'.jsonl.gz'
with gzip.open(jsonfilename, 'wb') as f:
     for dict in list_of_nested_dictionaries:
         content.append(json.dumps(dict).encode('utf-8'))
     f.write(content)

But I got this error: TypeError: memoryview: a bytes-like object is required, not 'list'

Then I tried just to gzip the list of dictionaries as is:

jsonfilename = 'valid_' +str(num)+'.jsonl.gz'
with gzip.open(jsonfilename, 'wb') as f:
    f.write(json.dumps(list_of_nested_dictionaries).encode('utf-8'))

But the problem here that it gzips the whole list as one block, and when I read it back I got one element which is the whole stored list but not a list of json files as I got from the first step.

this is the code that i use for reading

with gzip.open('valid_3.jsonl.gz', "r" , ) as read_file:
    try:
        json_list = list(read_file) # read zip file
        print(len(json_list))# I got 1 here
    except:
        print("fail")
json_list[0].decode('utf-8')
  • `json_list = list(read_file)` should probably be `json_list = json.load(read_file)` – jordanm May 01 '20 at 21:12
  • What line is causing the `TypeError: memoryview:`? – martineau May 01 '20 at 21:33
  • No, it is right, it works for me like that and every element from the list is a dictionary of listed dictionaries – student2020 May 01 '20 at 21:34
  • @martineau it is f.write(content) – student2020 May 01 '20 at 21:35
  • Change the `content.append(json.dumps(dict).encode('utf-8'))` to `f.write(json.dumps(dict).encode('utf-8'))` and remove the `f.write(content)`. Each "line" of a jsonl format file should be a singe (and complete) json Object. – martineau May 01 '20 at 22:53
  • @martineau I tried it the same problem, nothing is changed – student2020 May 01 '20 at 23:25
  • In that case your `list_of_nested_dictionaries` must actually be a list of lists, not what its name implies. You need to [edit] your question and provide a [mre] — you're leaving too many details out. – martineau May 02 '20 at 01:18
  • which details are not clear? Everything is clear even a solution I provided. By a list of a nested dictionary, I meant like this[dict1,dict2,dict3,...] where every dict has a structure like this {key1:value1, key2: dict_value, key3:value3} where dict_value is a dict – student2020 May 02 '20 at 06:24

2 Answers2

0

f.write(content) takes a byte-string, but you're passing it a list of byte-strings.

f.writelines(content) will iterate over and write each byte-string from the list.

Edit: by the way, gzip is meant for compressing a single file. If you need to compress multiple files into one, I suggest to pack them together in a tarball first and then gzip that.

MarkM
  • 798
  • 6
  • 17
  • again I had the same problem when reading the stored list, and I need to store it as the Form I read it first step. the .jsonl.gz files I read them in the first step were from an existing data set which I did not create it, it already exists. my purpose to store the valid for me .json files that I read them in the same format from I read. – student2020 May 01 '20 at 22:03
  • it works for me well before gzip, but after writing them to .jsonl.gz I could not get a list of .json files – student2020 May 01 '20 at 22:06
0

the solution is simply like this

     with gzip.open(jsonfilename, 'wb') as f:
     for dict in list_of_nested_dictionaries:
         content.append((json.dumps(dict)+'\n').encode('utf-8'))
     f.writelines(content)