2

I want to process quite big ARFF files in scikit-learn. The files are in a zip archive and I do not want to unpack the archive to a folder before processing. Hence, I use the zipfile module of Python 3.6:

from zipfile import ZipFile
from scipy.io.arff import loadarff

archive = ZipFile( 'archive.zip', 'r' )
datafile = archive.open( 'datafile.arff' )
data = loadarff( datafile )
# …
datafile.close()
archive.close()

However, this yields the following error:

Traceback (most recent call last):
  File "./m.py", line 6, in <module>
    data = loadarff( datafile )
  File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 541, in loadarff
    return _loadarff(ofile)
  File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 550, in _loadarff
    rel, attr = read_header(ofile)
  File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 323, in read_header
    while r_comment.match(i):
TypeError: cannot use a string pattern on a bytes-like object

According to loadarff documentation, loadarff requires a file-like object. According to zipfile documentation, open returns a file-like ZipExtFile.

Hence, my question is how to use what ZipFile.open returns as the ARFF input to loadarff.

Note: If I unzip manually and load the ARFF directly with data = loadarff( 'datafile.arff' ), all is fine.

Dharman
  • 30,962
  • 25
  • 85
  • 135
  • loadarff requires a file-like object. So you should read into a in-memory file like object. Can you try this ? `in_mem_fo = StringIO(archive.read('datafile.arff'))` – Nihal Sangeeth Mar 19 '19 at 07:54
  • This yields the Error: File "m.py", line 7, in in_mem_fo = StringIO(archive.read('datafile.arff')) TypeError: initial_value must be str or None, not bytes – Bernhard Bodenstorfer Mar 19 '19 at 07:59
  • But your idea let me find a solution: `in_mem_fo = StringIO(archive.read('datafile.arff').decode("utf-8"))` or `in_mem_fo = StringIO(archive.read('datafile.arff').decode("ascii"))` – Bernhard Bodenstorfer Mar 19 '19 at 08:04
  • Great. I have added an answer which might be a better solution. – Nihal Sangeeth Mar 19 '19 at 08:10

1 Answers1

1
from io import BytesIO, TextIOWrapper
from zipfile import ZipFile
from scipy.io.arff import loadarff

zfile = ZipFile('archive.zip', 'r')
in_mem_fo = TextIOWrapper(BytesIO(zfile.read('datafile.arff')), encoding='utf-8')
data = loadarff(in_mem_fo)

Read zfile into a in-memory BytesIO object. Use TextIOWrapper with encoding='utf-8'. Use this in-memory buffered text object in loadarff.

Edit: Turnsout zfile.open() returns a file-like object so the above can be accomplished by :

zfile = ZipFile('archive.zip', 'r')
in_mem_fo = TextIOWrapper(zfile.open('datafile.arff'), encoding='ascii')
data = loadarff(in_mem_fo)

Thanks @Bernhard

Nihal Sangeeth
  • 5,168
  • 2
  • 17
  • 32
  • 2
    Thanks, once again, your answer has inspired me to another solution which I find even more elegant, because it avoids putting all in memory first: `textfile = TextIOWrapper(datafile, encoding='ascii')` and then `data = loadarff( textfile )`. I suggest you put something like this into your solution as an edit and I accept so others can use it. – Bernhard Bodenstorfer Mar 19 '19 at 08:27