I want to process quite big ARFF files in scikit-learn. The files are in a zip archive and I do not want to unpack the archive to a folder before processing. Hence, I use the zipfile module of Python 3.6:
from zipfile import ZipFile
from scipy.io.arff import loadarff
archive = ZipFile( 'archive.zip', 'r' )
datafile = archive.open( 'datafile.arff' )
data = loadarff( datafile )
# …
datafile.close()
archive.close()
However, this yields the following error:
Traceback (most recent call last):
File "./m.py", line 6, in <module>
data = loadarff( datafile )
File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 541, in loadarff
return _loadarff(ofile)
File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 550, in _loadarff
rel, attr = read_header(ofile)
File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 323, in read_header
while r_comment.match(i):
TypeError: cannot use a string pattern on a bytes-like object
According to loadarff documentation, loadarff
requires a file-like object.
According to zipfile documentation, open
returns a file-like ZipExtFile
.
Hence, my question is how to use what ZipFile.open
returns as the ARFF input to loadarff
.
Note: If I unzip manually and load the ARFF directly with data = loadarff( 'datafile.arff' )
, all is fine.