0

Suppose I have a data file that has entries that look like this

0.00,2015-10-21,1,Y,798.78,323793701,6684,0.00,Q,H2512,PE0,1,0000

I would like to use this as an input to an mxnet model (basic Feed Forward Multi-layer Perecptron). A single input record has data types, in the order show above

float,date,int,categorical,float,int,int,float,categorical,categorical,categorical,int, float

each record is a meaningful representation of a specific entity. how do I represent this sort of data to mxnet? also, to complicate things a bit, suppose I want to one-hot encode the categorical columns? And what if each record has these fields, in the order show, but repeated multiple times in some cases such that each record may have a different length?

The docs are great for the basic cases where you have input data that is all of the same type and can all be loaded into the same input without any transformation but how to handle this case?

Update: adding some additional details. to keep this as simple as possible, let's say I just want to feed this into a simple network. something like:

my $data = mx->symbol->Variable("data");
my $fc = mx->symbol->FullyConnected($data, num_hidden => 1);
my $softmax=mx->symbol->SoftmaxOutput(data => $fc, name => "softmax");
my $module = mx->mod->new(symbol => $softmax);

in the simple case of the data being all one type and not requiring much in the way of pre-processing I then could just do something along the lines of

$module->fit(
    $train_iter,
    eval_data => $eval_iter,
    optimizer => "adam",
    optimizer_params=>{learning_rate=>0.001},
    eval_metric => "mse",
    num_epoch => 25
);

where $train_iter is a simple NDArray iterator over the training data. (Well, with the Perl API it's not exactly an NDArray, but has complete parity with that interface so it is conceptually identical).

sail0r
  • 445
  • 1
  • 5
  • 18
  • One approach is to define a Variable for each column. But please clarify first what network structure you have in mind. You either need a network structure that merges the different Variables in some layer, or you would need to take care of the data merging upfront, i.e. creating a single input vector. – leezu Aug 10 '17 at 22:21
  • @leezu I edited my original post with some details. Is that what you meant by "network structure"? I am just getting started and really am keeping it this simple for now. – sail0r Aug 11 '17 at 16:54

1 Answers1

2

NDArrayIter also supports multi input. You can use it as follows

data = {'data1':np.zeros(shape=(10,2,2)), 'data2':np.zeros(shape=(20,2,2))}
label = {'label1':np.zeros(shape=(10,1)), 'label2':np.zeros(shape=(20,1))}
dataiter = mx.io.NDArrayIter(data, label, 3, True, last_batch_handle='discard')

Before that you will have to convert your non-numeric data into numerical data. This could be in the form of a one-hot vector or some other fashion which depends on the meaning of that variable.

As for the question regarding samples have different length, the easiest way would be to bring them all to a common length by padding the shorter ones with 0s.

rahul003
  • 155
  • 1
  • 6