1

I am having trouble with basic IO with mxnet. I am attempting to use mxnet.io.NDArrayIter to read in-memory datasets for training in mxnet. I have the below code (condensed for brevity) which preprocesses the code and attempt to iterate through it (heavily based on the tutorial):

import csv
import mxnet as mx
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline


with open('data.csv', 'r') as data_file:
    data = list(csv.reader(data_file))

labels = np.array(map(lambda x: x[1], data)) # one-hot encoded classes
data = map(lambda x: x[0], data) # raw text in need of pre-processing

transformer = Pipeline(steps=(('count_vectorizer', CountVectorizer()),
                              ('tfidf_transformer', TfidfTransformer())))

preprocessed_data = np.array([np.array(row) for row in transformer.fit_transform(data)])

training_data = mx.io.NDArrayIter(data=preprocessed_data, label=labels, batch_size=50)

for i, batch in enumerate(training_data):
    print(batch)

When executing this code, I receive the following error:

    Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/mxnet/io.py", line 510, in _init_data
    data[k] = array(v)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/ndarray/utils.py", line 146, in array
    return _array(source_array, ctx=ctx, dtype=dtype)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/ndarray/ndarray.py", line 2245, in array
    arr[:] = source_array
  File "/usr/local/lib/python3.5/dist-packages/mxnet/ndarray/ndarray.py", line 437, in __setitem__
    self._set_nd_basic_indexing(key, value)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/ndarray/ndarray.py", line 698, in _set_nd_basic_indexing
    self._sync_copyfrom(value)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/ndarray/ndarray.py", line 856, in _sync_copyfrom
    source_array = np.ascontiguousarray(source_array, dtype=self.dtype)
  File "/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py", line 581, in ascontiguousarray
    return array(a, dtype, copy=False, order='C', ndmin=1)
TypeError: float() argument must be a string or a number, not 'csr_matrix'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mxnet_test.py", line 20, in <module>
    training_data = mx.io.NDArrayIter(data=preprocessed_data, label=labels, batch_size=50)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/io.py", line 643, in __init__
    self.data = _init_data(data, allow_empty=False, default_name=data_name)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/io.py", line 513, in _init_data
    "should be NDArray, numpy.ndarray or h5py.Dataset")
TypeError: Invalid type '<class 'numpy.ndarray'>' for data, should be NDArray, numpy.ndarray or h5py.Dataset

which I do not understand, as my data is being converted to a numpy.ndarray before creating the NDArrayIter instance. Would someone be willing to provide some insight on how to read data in mxnet?

The code above is currently using the following versions:

  • mxnet-1.1.0
  • numpy-1.14.2
DFenstermacher
  • 564
  • 1
  • 9
  • 23
  • Bleh. The [code](https://github.com/apache/incubator-mxnet/blob/1.1.0/python/mxnet/io.py#L509) slaps a blanket `except` around a line and assumes all exceptions inside the `try` are due to a bad input type. This would probably be easier to debug on Python 3, with exception chaining. – user2357112 Apr 12 '18 at 04:51
  • Try setting a pdb breakpoint inside `_init_data`. – user2357112 Apr 12 '18 at 04:52
  • You were right, using Python 3, with exception chaining was extremely helpful (code updated for Python 3 above). I ended up not needing to use `pdb` to find the error. The `TfidfTransformer` was returning a `scipy.sparse.csr_matrix` instead of a numpy.array as as I was anticipating [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer.fit_transform). I will write an answer below indicating what I changes to make the process work. – DFenstermacher Apr 12 '18 at 14:15

1 Answers1

2

With the help of user2357112, this was resolved by using exception chaining in Python 3 to find the exception (updated in question):

The transformer pipeline was returning a numpy.array of scipy.sparse.csr_matrix matrices instead of a 2-d numpy.array. By adding changing the following line to use the toarray method for the conversion instead, the script will run.

preprocessed_data = np.array([row.toarray() for row in transformer.fit_transform(data)])

optimal solution: toarray is inefficient in terms of memory consumption when used on a scipy.sparse.csr_matrix. In version 1.10 of mxnet, one can use mxnet.nd.sparse.array to more efficiently store the data:

...
preprocessed_data = mx.nd.sparse.array(transformer.fit_transform(data))

training_data = mx.io.NDArrayIter(data=preprocessed_data, label=preprocessed_labels, batch_size=5, last_batch_handle='discard')

for i, batch in enumerate(training_data):
    print(batch)

With the only caveat being that one must use the last_batch_handle='discard' keyword argument in the NDArrayIter (functionality of last_batch_handle here)

DFenstermacher
  • 564
  • 1
  • 9
  • 23