I have a folder full of json files (~10gb, mostly text data) that have been zipped with gzip. I currently have code that works but is painfully slow (think several hours):
from glob import glob
filenames = glob('folder_path/*')
dataframes = [pd.read_json(f, compression='gzip') for f in filenames]
I'm hoping to find a quicker way to unzip all the files, and save each on one to a pandas df or all of them to a single df (the 1 vs many dfs doesn't matter to me at this point). I've read about zlib
but that doesn't seem to work for gzip files? I've tried a few different things there too, but none seem to work, like:
filenames = glob('folder_path/*')
jsonobjs = [gzip.open(f, 'rb') for f in filenames]
returns:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-12-a5a84131bb38> in <module>
1 filenames = glob('folder_path/*')
----> 2 jsonobjs = [gzip.open(f, 'rb') for f in filenames]
<ipython-input-12-a5a84131bb38> in <listcomp>(.0)
1 filenames = glob('folder_path/*')
----> 2 jsonobjs = [gzip.open(f, 'rb') for f in filenames]
~/anaconda3/lib/python3.7/gzip.py in open(filename, mode, compresslevel, encoding, errors, newline)
51 gz_mode = mode.replace("t", "")
52 if isinstance(filename, (str, bytes, os.PathLike)):
---> 53 binary_file = GzipFile(filename, gz_mode, compresslevel)
54 elif hasattr(filename, "read") or hasattr(filename, "write"):
55 binary_file = GzipFile(None, gz_mode, compresslevel, filename)
~/anaconda3/lib/python3.7/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
161 mode += 'b'
162 if fileobj is None:
--> 163 fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
164 if filename is None:
165 filename = getattr(fileobj, 'name', '')
OSError: [Errno 24] Too many open files: 'folder_path/d2ea1c35275b495fb73cb123cdf4fe4c'
and
with gzip.open(glob('folder_path/*'), 'rb') as f:
file_content = f.read()
returns:
TypeError Traceback (most recent call last)
<ipython-input-10-bd68570238cd> in <module>
----> 1 with gzip.open(glob('folder_path/*'), 'rb') as f:
2 file_content = f.read()
TypeError: 'module' object is not callable
So this:
with gzip.open('single_file', 'rb') as f:
file_content = f.read()
pd.read_json(file_content)
works just fine, and is faster than passing compression='gzip' to pd.read_json, but I don't know how to get that to work for all the files.
edit: Tried the following:
for file_name in glob('folder_path/*'):
with [gzip.open(f, 'rb') for f in filenames]:
file_name = pd.read_json(f)
but that returns the same too many open files
error