0

I have a folder full of json files (~10gb, mostly text data) that have been zipped with gzip. I currently have code that works but is painfully slow (think several hours):

from glob import glob
filenames = glob('folder_path/*')
dataframes = [pd.read_json(f, compression='gzip') for f in filenames]

I'm hoping to find a quicker way to unzip all the files, and save each on one to a pandas df or all of them to a single df (the 1 vs many dfs doesn't matter to me at this point). I've read about zlib but that doesn't seem to work for gzip files? I've tried a few different things there too, but none seem to work, like:

filenames = glob('folder_path/*')
jsonobjs = [gzip.open(f, 'rb') for f in filenames]

returns:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-12-a5a84131bb38> in <module>
      1 filenames = glob('folder_path/*')
----> 2 jsonobjs = [gzip.open(f, 'rb') for f in filenames]

<ipython-input-12-a5a84131bb38> in <listcomp>(.0)
      1 filenames = glob('folder_path/*')
----> 2 jsonobjs = [gzip.open(f, 'rb') for f in filenames]

~/anaconda3/lib/python3.7/gzip.py in open(filename, mode, compresslevel, encoding, errors, newline)
     51     gz_mode = mode.replace("t", "")
     52     if isinstance(filename, (str, bytes, os.PathLike)):
---> 53         binary_file = GzipFile(filename, gz_mode, compresslevel)
     54     elif hasattr(filename, "read") or hasattr(filename, "write"):
     55         binary_file = GzipFile(None, gz_mode, compresslevel, filename)

~/anaconda3/lib/python3.7/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
    161             mode += 'b'
    162         if fileobj is None:
--> 163             fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
    164         if filename is None:
    165             filename = getattr(fileobj, 'name', '')

OSError: [Errno 24] Too many open files: 'folder_path/d2ea1c35275b495fb73cb123cdf4fe4c'

and

with gzip.open(glob('folder_path/*'), 'rb') as f:
    file_content = f.read()

returns:

TypeError                                 Traceback (most recent call last)
<ipython-input-10-bd68570238cd> in <module>
----> 1 with gzip.open(glob('folder_path/*'), 'rb') as f:
      2         file_content = f.read()

TypeError: 'module' object is not callable

So this:

with gzip.open('single_file', 'rb') as f:
    file_content = f.read()
pd.read_json(file_content)

works just fine, and is faster than passing compression='gzip' to pd.read_json, but I don't know how to get that to work for all the files.

edit: Tried the following:

for file_name in glob('folder_path/*'):
     with [gzip.open(f, 'rb') for f in filenames]:
            file_name = pd.read_json(f)

but that returns the same too many open files error

LMGagne
  • 1,636
  • 6
  • 24
  • 47
  • maybe try dask? https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_json – Ben Pap Jan 06 '20 at 22:16
  • I just ran a quick test using `pd.read_json` with some randomly generated, gzipped JSON files. 1000 files, 40 kb of zipped data per file, so 40 MB total data. The unzipped size was 120kb each. Each file was increasing the python memory footprint by 5-6 MB as it was read in with `pd.read_json`. So this is using a massive amount of memory if you are going through 10 GB of files all at once. How much memory do you have? I'm surprised this is ever completing for you. What is the end goal of the data? Does it all need to be handled at once? – totalhack Jan 06 '20 at 22:40
  • @totalhack It does all need to be handled at once. Currently it's taking about a full day to open all the files and I actually have 4 folders of these files (so ~40GB total) that eventually all need to be opened, transformed, and combined. I'm looking into a multicore or distributed solution for this, but I don't have much experience actually implementing that. – LMGagne Jan 07 '20 at 15:36
  • If the memory usage in my example holds even remotely close for you, you will need a few TB of memory to hold this all at once. And you may need a multiple of that depending on the operations you want to do to the data once you have it in memory. If you could provide more detail on what you are trying to do with the data that would be helpful. Do you have to use pandas? What went wrong when you tried using gzip.open? The fact that you are ok having multiple DFs suggest to me you might not actually need this all in memory at once. – totalhack Jan 07 '20 at 16:24
  • @totalhack edited the post to include more tracebacks. If I found a way to open and transform the files quickly, but that resulted in multiple dfs I would have to combine them as the next step. – LMGagne Jan 07 '20 at 19:17
  • First traceback is you running into a file system limitation. Second one is likely a syntax/user error...at a minimum the code in the traceback doesn't match the code you gave as an example above it, the path is different. Looking quickly at docs it doesn't look like gzip.open can take an iterable. Fixing that should allow you to read the file contents in a loop that closes the file handle on each iteration. What do you intend to do with the data once its all in memory? – totalhack Jan 07 '20 at 19:26
  • @totalhack the mismatch was a copy/paste error, I've fixed it now to match. I'm not sure how to use gzip.open to iterate through all the files except the ways I've already tried. I need to run the data through an LDA model. – LMGagne Jan 07 '20 at 19:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/205554/discussion-between-totalhack-and-lmgagne). – totalhack Jan 07 '20 at 20:01

1 Answers1

1

I've walked the OP through some changes to address the preliminary tracebacks, which should allow the OP to get this process working on a smaller data set. However, the real issue is that the data set is too large to really do this. Since the goal is to train an LDA model I've suggested the OP look into libraries that support online learning so the model can be built up without an impossible memory footprint.

This isn't an answer to the more general topic of "Unzipping multiple json files from folder into pandas df", but that wasn't really the main question. The following (untested) code could loop over gzipped files in a folder and read each into a dataframe. Then either concat or process those dataframes as necessary.

from glob import glob
import gzip

for fname in glob('folder_path/*gz'):
    with gzip.open(fname, 'rb') as f:
        df = pd.read_json(f)

Note that it's pretty slow to do this over many files with pandas. You'd likely be better off reading and parsing the raw JSON structures, cleaning/transforming them as necessary, and then forming a final pandas dataframe on all the combined data (or chunks of the data). Or avoid pandas altogether if it's not truly necessary.

totalhack
  • 2,298
  • 17
  • 23