0

Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error: "type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".

atos
  • 37
  • 4
  • Don't have the experience with dl4j to write an answer, but does this help? https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-spark-examples/dl4j-spark/src/main/java/org/deeplearning4j/mlp/MnistMLPExample.java – Ethan Sep 18 '18 at 15:25
  • Not exactly. It does not use Dataframe from Spark, instead MnistDataSetIterator is used. Generally, I found some examples of how you construct DataSet, but I do not know if this is enough. I thought that maybe there is some implementation in the already existing API, which I do not see. – atos Sep 18 '18 at 17:12
  • can you try to parallelize your Dataframe with `sparkContext.parallelize(yourDataFrame)`, this should create and `RDD[DataSet]`? `sparkContext` is part of `SparkSession` in 2.x and `sc` in 1.x – emran Sep 18 '18 at 18:49
  • As far as I know `Dataframe` does not need to be parallelized because it is a distributed data type (I even receive a warning about incompatible types).In addition, `Dataset` is the type from Spark, and `DataSet` from `org.nd4j.linalg.dataset.DataSet` – atos Sep 18 '18 at 19:25
  • Hey folks, the comments here using data set iterators are wrong. Please do not use that with spark. You need to look a bit beyond the hello world in the examples if you are going to be using dl4j with columnar data. A better example to *actually* look at is the data vec examples: https://github.com/deeplearning4j/dl4j-examples/blob/master/datavec-examples/src/main/java/org/datavec/transform/logdata/LogDataExample.java – Adam Gibson Sep 19 '18 at 02:22

1 Answers1

0

In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.

Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)

We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java

But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".

In general, a big reason we do this is because all of our ndarrays are off heap. Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.

When we do that conversion, it ends up being: raw data -> numerical representation -> ndarray

What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do: someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))

That will give you a "dataset" You need a pair of ndarrays matching your input data and a label. The label from there is relative to the kind of problem you're solving whether that be classification or regression though.

Adam Gibson
  • 3,055
  • 1
  • 10
  • 12