Speed up parsing of gzipped jsonlines files

Question

I have about 5,000 .gzip files (~1MB each). Each of these files contains data in a jsonlines format. Here's what it looks like:

{"category_id":39,"app_id":12731}
{"category_id":45,"app_id":12713}
{"category_id":6014,"app_id":13567}

I want to parse these files and convert them to a pandas dataframe. Is there a way to speed up this process? Here's my code but it's kinda slow (0.5s per file)

import pandas as pd
import jsonlines
import gzip
import os
import io


path = 'data/apps/'
files = os.listdir(path)

result = []
for n, file in enumerate(files):
    print(n, file)
    with open(f'{path}/{file}', 'rb') as f:
        data = f.read()

    unzipped_data = gzip.decompress(data)

    decoded_data = io.BytesIO(unzipped_data)
    reader = jsonlines.Reader(decoded_data)

    for line in reader:
        if line['category_id'] == 6014:
            result.append(line)


df = pd.DataFrame(result)

score 0 · Answer 1 · answered Mar 23 '20 at 14:38

0

This should allow you to read each line without loading the whole file.

import pandas as pd
import json
import gzip
import os


path = 'data/apps/'
files = os.listdir(path)

result = []
for n, file in enumerate(files):
    print(n, file)
    with gzip.open(f'{path}/{file}') as f:
        for line in f:
            data = json.loads(line)
            if data['category_id'] == 6014:
                result.append(data)


df = pd.DataFrame(result)

answered Mar 23 '20 at 14:38

Matt Shin

424
2
7

You just do `result.append(data)` with the `if ...` line above? – Matt Shin Mar 23 '20 at 14:59
Yeah, i have other scripts that don't need to loop through lines and filter, but they are slow too. Or eat too much memory for some reason – Superbman Mar 23 '20 at 15:48

Speed up parsing of gzipped jsonlines files

1 Answers1