-1

How can I read a gzip file which has JSON content in it and then write that content into a text file.

with open('.../notebooks/decompressed.txt', 'wb') as f_out:
    with gzip.open(".../2020-04/statuses.log.2020-04-01-00.gz", 'rb') as f_in:
        data = f_in.read()
        json.dumps(data)

Error: Object of type bytes is not JSON serializable

decompressed.txt image(first 2 lines): enter image description here

  • Does this answer your question? [TypeError: Object of type 'bytes' is not JSON serializable](https://stackoverflow.com/questions/44682018/typeerror-object-of-type-bytes-is-not-json-serializable) – Andrej Podzimek Oct 05 '21 at 15:26
  • I am little new to this. I tried different codes like json.dump(data,f_out) etc but nothing is working. – Trishala Suryavanshi Oct 05 '21 at 15:30
  • Have the verified the gzipped file is a single JSON serialized file either starting and ending with '[' and ']' or '{' and '}' ?? – CodeMonkey Oct 05 '21 at 15:33
  • The file contains twitter data with structure like: { "a":value,"b":value} { "a":value,"b":value} { "a":value,"b":value} – Trishala Suryavanshi Oct 05 '21 at 15:36
  • You would use `json.loads`, not `json.dumps` to convert JSON input to a structure. However if you're writing it back out to a file, then you'd need to reserialize it again. So it seems like all you need to do is write `data` to your output file, and not use `json` at all. – Mark Adler Oct 05 '21 at 15:44
  • I tried that: with open('.../notebooks/decompressed.txt', 'wb') as f_out: with gzip.open(".../2020-04/statuses.log.2020-04-01-00.gz", 'rb') as f_in: f_out.writelines(f_in) But then when I try to load a JSON object using json.loads it doesn’t accept the format. – Trishala Suryavanshi Oct 05 '21 at 15:48
  • Can you run the cmd: `gunzip -c ../2020-04/statuses.log.2020-04-01-00.gz | tail` to see if that works correctly or is there a problem with the file? – CodeMonkey Oct 05 '21 at 16:06
  • This command is working fine and showing last records of file. – Trishala Suryavanshi Oct 05 '21 at 16:14
  • Is the logfile a single JSON structure or a series of concatented JSON objects on each line? If each line is something like *{'a':value}* then it may be the latter and jsons.load() or json.load() won't work on that directly. – CodeMonkey Oct 05 '21 at 16:47
  • Please update question with a repesentative snippet of the first and last lines of the decompressed log file. – CodeMonkey Oct 05 '21 at 16:52
  • I am trying to add the first and last line but it is huge to stackoverflow is not allowing me to do that – Trishala Suryavanshi Oct 05 '21 at 16:58
  • I added a picture for decompressed.txt – Trishala Suryavanshi Oct 05 '21 at 17:10
  • Looks like file is a series of concatenated json objects. Try *for line in fin: obj = json.loads(line)*. Updated answer to address this issue. – CodeMonkey Oct 05 '21 at 17:35
  • Thank you! The code worked!! – Trishala Suryavanshi Oct 05 '21 at 19:58

1 Answers1

0

If log content is already json serialized format then just need to write decompressed data as-is.

import gzip
with gzip.open('.../2020-04/statuses.log.2020-04-01-00.gz', 'rb') as fin:
    with open('.../notebooks/decompressed.txt', 'wb') as fout:
        data = fin.read()
        fout.write(data)

If file is huge then import shutil module and replace read() and write() with:

shutil.copyfileobj(fin, fout)

If want to load JSON into a object and reserialize then:

import gzip
import json

with gzip.open('.../2020-04/statuses.log.2020-04-01-00.gz', 'rb') as fin:
    with open('.../notebooks/decompressed.txt', 'w') as fout:
       obj = json.load(fin)
       json.dump(obj, fout)

If the log file is a series of JSON structures one per line then try:

import gzip
with gzip.open('.../2020-04/statuses.log.2020-04-01-00.gz', 'rb') as fin:
    for line in fin:
        obj = json.loads(line)
        # next do something with obj

If JSON is too large to deserialize then try ijson to iterate over hugh JSON structures.

CodeMonkey
  • 22,825
  • 4
  • 35
  • 75
  • f_out.write(data) is giving error:a bytes-like object is required, not 'str' – Trishala Suryavanshi Oct 05 '21 at 15:56
  • Updated code in answer. if use fin.read().decode() then use "w" mode in output for string content otherwise no decode() on data input and "wb" on output for binary data. – CodeMonkey Oct 05 '21 at 15:57
  • Updated code is giving this error: Extra data: line 2 column 1 (char 1994) at line obj = json.loads(json_data) – Trishala Suryavanshi Oct 05 '21 at 16:00
  • If I use shutil.copyfileobj(f_in, f_out) then how would load JSON object while reading the file? – Trishala Suryavanshi Oct 05 '21 at 16:17
  • If `json.loads(json_data)` is failing then data is not valid JSON. Need to print the data to see what it is. – CodeMonkey Oct 05 '21 at 16:21
  • I understood that I can create a decompressed.txt file using shutil.copyfileobj(f_in, f_out). But in future when I want the read data from decompressed.txt file then would json.loads work? – Trishala Suryavanshi Oct 05 '21 at 16:22
  • Try *obj = json.load(fin)* when loading *decompressed_twitter_lot1.txt* input file. – CodeMonkey Oct 05 '21 at 16:34
  • json_data = fin.read().decode() obj = json.loads(json_data). I tried this after creating decompressed.txt and opening the file again. It is giving Memoryerror with no description. And also obj =json.load(fin) also giving memory error. – Trishala Suryavanshi Oct 05 '21 at 16:35
  • There are modules to stream json files. see [ijson](https://pypi.org/project/ijson/); see also [this](https://stackoverflow.com/questions/10382253/reading-rather-large-json-files). – CodeMonkey Oct 05 '21 at 16:37
  • Then I should be creating file with Ijson? – Trishala Suryavanshi Oct 05 '21 at 16:48
  • ijson is to parse huge JSON assuming file is a single JSON structure not series of concatened JSON objects. Can't say much w/o seeing example of the first and last lines of the decompressed input file. – CodeMonkey Oct 05 '21 at 16:48