-1

I tested anomaly detection using Deeplearning4j, everything works fine except that, I am not able to preserve the VehicleID while training. What is the best approach in such scenario?

Please look at the following snippet of code, SparkTransformExecutor returns a RDD and InMemorySequence is taking a list when, I am collecting list from RDD indexing is not guaranteed.

  val records:JavaRDD[util.List[util.List[Writable]]] = SparkTransformExecutor
  .executeToSequence(.....)
   val split = records.randomSplit(Array[Double](0.7,0.3))
  val testSequences = split(1)

 //in memory  sequence reader
  val testRR = new InMemorySequenceRecordReader(testSequences.collect().toList)

   val testIter = new RecordReaderMultiDataSetIterator.Builder(batchSize)
           .addSequenceReader("records", trainRR)
           .addInput("records")
          .build()
Harvinder Singh
  • 681
  • 7
  • 20
  • I resolved this issue by writing a CustomSequenceRecordReader and CustomMetaData, now one of the column in my input data is referred as metadata – Harvinder Singh Aug 07 '18 at 12:55

1 Answers1

1

Typically you track training examples by index in a dataset. Track which index that dataset is vehicle is in the dataset alongside training. There are a number of ways to do that.

In dl4j, we typically keep the data raw and use record readers + transform processes for the training data. If you use a record reader on raw data (pick one for your dataset, it could be csv or even video) and use a recordreader datasetiterator like here: ```java RecordReader recordReader = new CSVRecordReader(0, ','); recordReader.initialize(new FileSplit(new ClassPathResource("iris.txt").getFile())); int labelIndex = 4; int numClasses = 3; int batchSize = 150;

    RecordReaderDataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
    iterator.setCollectMetaData(true);  //Instruct the iterator to collect metadata, and store it in the DataSet objects
    DataSet allData = iterator.next();


    DataSet trainingData = testAndTrain.getTrain();
    DataSet testData = testAndTrain.getTest();

```

(Complete code here): https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/dataexamples/CSVExampleEvaluationMetaData.java

Alongside this you use TransformProcess:

```

   //Let's define the schema of the data that we want to import
    //The order in which columns are defined here should match the 
    //order in which they appear in the input data
    Schema inputDataSchema = new Schema.Builder()
        //We can define a single column
        .addColumnString("DateTimeString")

.... .build(); //At each step, we identify column by the name we gave them in the
input data schema, above

TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
//your transforms go here

    .build();

```

Complete example below:

https://github.com/deeplearning4j/dl4j-examples/blob/6967b2ec2d51b0d19b5d6437763a2936ca922a0a/datavec-examples/src/main/java/org/datavec/transform/basic/BasicDataVecExampleLocal.java

If you use these things, you customize keep the data as is, but have a complete data pipeline. There are a lot of ways to do it, just keep in mind you start with the vehicle id, it doesn't have to disappear.

Adam Gibson
  • 3,055
  • 1
  • 10
  • 12
  • Index based approach should be avoided while working on a distributed dataset using Apache Spark. It is still not clear from your examples, how VehicleID can be propagated when data is distributed across partition. It seems, I am missing the obvious here. – Harvinder Singh Aug 03 '18 at 13:33
  • I'm not sure how spark or distributed is relevant here. What was discussed here was local datasets. If you are going to do spark, you would still do indexed batches. Not all datasets are csvs. Many are binary tensors. Those are usually pre created before hand. Spark can't handle these kinds of datasets. – Adam Gibson Aug 04 '18 at 02:52
  • Hi Adam, I have pasted snippet of code in the question section for your reference. – Harvinder Singh Aug 06 '18 at 08:29
  • I can selectively specify my data columns in RecordReaderMultiDataSetIterator, I could not find a way to specify MetaData in this. Can I set the metadata in preProcessor or there is some other simpler way? – Harvinder Singh Aug 06 '18 at 13:45
  • I solved my problem by writing custom SequenceRecordReader and a custom MetaData class, now I can set the required attribute as meta data. – Harvinder Singh Aug 07 '18 at 12:53