Read a gzip file and write into text file

Question

How can I read a gzip file which has JSON content in it and then write that content into a text file.

with open('.../notebooks/decompressed.txt', 'wb') as f_out:
    with gzip.open(".../2020-04/statuses.log.2020-04-01-00.gz", 'rb') as f_in:
        data = f_in.read()
        json.dumps(data)

Error: Object of type bytes is not JSON serializable

decompressed.txt image(first 2 lines): enter image description here

Does this answer your question? [TypeError: Object of type 'bytes' is not JSON serializable](https://stackoverflow.com/questions/44682018/typeerror-object-of-type-bytes-is-not-json-serializable) — Andrej Podzimek, Oct 05 '21 at 15:26
I am little new to this. I tried different codes like json.dump(data,f_out) etc but nothing is working. — Trishala Suryavanshi, Oct 05 '21 at 15:30
Have the verified the gzipped file is a single JSON serialized file either starting and ending with '[' and ']' or '{' and '}' ?? — CodeMonkey, Oct 05 '21 at 15:33
The file contains twitter data with structure like: { "a":value,"b":value} { "a":value,"b":value} { "a":value,"b":value} — Trishala Suryavanshi, Oct 05 '21 at 15:36
You would use `json.loads`, not `json.dumps` to convert JSON input to a structure. However if you're writing it back out to a file, then you'd need to reserialize it again. So it seems like all you need to do is write `data` to your output file, and not use `json` at all. — Mark Adler, Oct 05 '21 at 15:44
I tried that: with open('.../notebooks/decompressed.txt', 'wb') as f_out: with gzip.open(".../2020-04/statuses.log.2020-04-01-00.gz", 'rb') as f_in: f_out.writelines(f_in) But then when I try to load a JSON object using json.loads it doesn’t accept the format. — Trishala Suryavanshi, Oct 05 '21 at 15:48
Can you run the cmd: `gunzip -c ../2020-04/statuses.log.2020-04-01-00.gz | tail` to see if that works correctly or is there a problem with the file? — CodeMonkey, Oct 05 '21 at 16:06
This command is working fine and showing last records of file. — Trishala Suryavanshi, Oct 05 '21 at 16:14
Is the logfile a single JSON structure or a series of concatented JSON objects on each line? If each line is something like *{'a':value}* then it may be the latter and jsons.load() or json.load() won't work on that directly. — CodeMonkey, Oct 05 '21 at 16:47
Please update question with a repesentative snippet of the first and last lines of the decompressed log file. — CodeMonkey, Oct 05 '21 at 16:52
I am trying to add the first and last line but it is huge to stackoverflow is not allowing me to do that — Trishala Suryavanshi, Oct 05 '21 at 16:58
Looks like file is a series of concatenated json objects. Try *for line in fin: obj = json.loads(line)*. Updated answer to address this issue. — CodeMonkey, Oct 05 '21 at 17:35

CodeMonkey · Accepted Answer · 2021-10-05T17:37:59.840

0

If log content is already json serialized format then just need to write decompressed data as-is.

import gzip
with gzip.open('.../2020-04/statuses.log.2020-04-01-00.gz', 'rb') as fin:
    with open('.../notebooks/decompressed.txt', 'wb') as fout:
        data = fin.read()
        fout.write(data)

If file is huge then import shutil module and replace read() and write() with:

shutil.copyfileobj(fin, fout)

If want to load JSON into a object and reserialize then:

import gzip
import json

with gzip.open('.../2020-04/statuses.log.2020-04-01-00.gz', 'rb') as fin:
    with open('.../notebooks/decompressed.txt', 'w') as fout:
       obj = json.load(fin)
       json.dump(obj, fout)

If the log file is a series of JSON structures one per line then try:

import gzip
with gzip.open('.../2020-04/statuses.log.2020-04-01-00.gz', 'rb') as fin:
    for line in fin:
        obj = json.loads(line)
        # next do something with obj

If JSON is too large to deserialize then try ijson to iterate over hugh JSON structures.

edited Oct 05 '21 at 17:37

answered Oct 05 '21 at 15:46

CodeMonkey

22,825
4
35
75

f_out.write(data) is giving error:a bytes-like object is required, not 'str' – Trishala Suryavanshi Oct 05 '21 at 15:56
Updated code in answer. if use fin.read().decode() then use "w" mode in output for string content otherwise no decode() on data input and "wb" on output for binary data. – CodeMonkey Oct 05 '21 at 15:57
Updated code is giving this error: Extra data: line 2 column 1 (char 1994) at line obj = json.loads(json_data) – Trishala Suryavanshi Oct 05 '21 at 16:00
If I use shutil.copyfileobj(f_in, f_out) then how would load JSON object while reading the file? – Trishala Suryavanshi Oct 05 '21 at 16:17
If `json.loads(json_data)` is failing then data is not valid JSON. Need to print the data to see what it is. – CodeMonkey Oct 05 '21 at 16:21
I understood that I can create a decompressed.txt file using shutil.copyfileobj(f_in, f_out). But in future when I want the read data from decompressed.txt file then would json.loads work? – Trishala Suryavanshi Oct 05 '21 at 16:22
Try *obj = json.load(fin)* when loading *decompressed_twitter_lot1.txt* input file. – CodeMonkey Oct 05 '21 at 16:34
json_data = fin.read().decode() obj = json.loads(json_data). I tried this after creating decompressed.txt and opening the file again. It is giving Memoryerror with no description. And also obj =json.load(fin) also giving memory error. – Trishala Suryavanshi Oct 05 '21 at 16:35
There are modules to stream json files. see [ijson](https://pypi.org/project/ijson/); see also [this](https://stackoverflow.com/questions/10382253/reading-rather-large-json-files). – CodeMonkey Oct 05 '21 at 16:37
Then I should be creating file with Ijson? – Trishala Suryavanshi Oct 05 '21 at 16:48
ijson is to parse huge JSON assuming file is a single JSON structure not series of concatened JSON objects. Can't say much w/o seeing example of the first and last lines of the decompressed input file. – CodeMonkey Oct 05 '21 at 16:48

Read a gzip file and write into text file

1 Answers1