Background:
I need to write a CSV file that I compress before putting to disk as I'm running about 96 processes simultaneously on an SMP and they otherwise fill up the tiny hard drive space I have before I can offload them elsewhere (no, it's not my system so don't ask my how a 104 CPU / 0.25TB RAM / 8 Tesla server only has 2TB shared for all users that is 90+% full). I need to use as many processors as I can since 1 CPU would take me almost 4 years and using 96 drops that to about 2 weeks.
All of the answers to similar questions state that you should use bz2.open()
with mode 'wt'; however, I have not found any that address using the bz2 file-like object with a csv.writer()
object and it just does not seem to work. I've even written a script to test all the possible write mode permutations (see below) that reproduces the problem faithfully.
Note: I cannot simply to ','.join(row)
which would overwise work with a mode='wt'
bz2 object because many of the text fields need escaping with line breaks, embedded commas, embedded '\x00' chars, etc.
Steps to reproduce:
/tmp/test.py:
import os
import bz2
import csv
import traceback
tfile = '/tmp/test.csv.bz'
row = ['bc22jtr', 118324, None, 'contran', None, 11.5, 9.23, ]
def perr(err, bmode, fmode=None):
"""Func for printing exception info in a less noisy manner."""
print(
f"EXCEPTION: wt.writerow(row) → {type(err).__name__}:"
f" {err}; bmode='{bmode}', fmode='{fmode}'"
)
print((''.join(traceback.format_exception(err)[-2:-1])).strip())
return True
for fmode in ["w", "wt", "wb"]:
for bmode in ["w", "wb"]:
had_err = False
if os.path.exists(tfile):
os.remove(tfile)
fh = open(tfile, fmode)
try:
bh = bz2.BZ2File(fh, mode=bmode, compresslevel=9)
except ValueError as err:
had_err = perr(err, bmode, fmode)
wt = csv.writer(fh)
try:
wt.writerow(row)
except TypeError as err:
had_err = perr(err, bmode, fmode)
try:
bh.close()
except TypeError as err:
had_err = perr(err, bmode, fmode)
if not had_err:
prnt(f"WAS OK: bmode={bmode}, fmode={fmode}")
for bmode in ["w", "wb", "wt"]:
if os.path.exists(tfile):
os.remove(tfile)
bh = bz2.open(fh, mode=bmode, compresslevel=9)
wt = csv.writer(fh)
had_err = False
try:
wt.writerow(row)
except TypeError as err:
had_err = perr(err, bmode)
try:
bh.close()
except TypeError as err:
had_err = perr(err, bmode)
if not had_err:
prnt(f"WAS OK: bmode={bmode}")
if os.path.exists(tfile):
os.remove(tfile)
Output:
> python3 /tmp/test.py
EXCEPTION: wt.writerow(row) → TypeError: write() argument must be str, not bytes; bmode='w', fmode='w'
File "/usr/lib/python3.10/bz2.py", line 109, in close
self._fp.write(self._compressor.flush())
EXCEPTION: wt.writerow(row) → TypeError: write() argument must be str, not bytes; bmode='wb', fmode='w'
File "/usr/lib/python3.10/bz2.py", line 109, in close
self._fp.write(self._compressor.flush())
EXCEPTION: wt.writerow(row) → TypeError: write() argument must be str, not bytes; bmode='w', fmode='wt'
File "/usr/lib/python3.10/bz2.py", line 109, in close
self._fp.write(self._compressor.flush())
EXCEPTION: wt.writerow(row) → TypeError: write() argument must be str, not bytes; bmode='wb', fmode='wt'
File "/usr/lib/python3.10/bz2.py", line 109, in close
self._fp.write(self._compressor.flush())
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='w', fmode='wb'
File "/tmp/test.py", line 33, in <module>
wt.writerow(row)
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='wb', fmode='wb'
File "/tmp/test.py", line 33, in <module>
wt.writerow(row)
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='w', fmode='None'
File "/tmp/test.py", line 49, in <module>
wt.writerow(row)
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='wb', fmode='None'
File "/tmp/test.py", line 49, in <module>
wt.writerow(row)
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='wt', fmode='None'
File "/tmp/test.py", line 49, in <module>
wt.writerow(row)
Note: bmode='wt'
is not tested in the first loop since bz2.BZ2File(fh, mode='wt')
will always raise a ValueError: Invalid mode: 'wt'
exception.
Question
How can I write a compressed CSV with proper excaping and encoding on the fly?