Following the problems from the book Implement 10 real-world deep learning applications using Deeplearning4j

Question

I'm following the problems from the book Implement 10 real-world deep learning applications using Deeplearning4j by Rezaul Karim. I have the ML background but I'm learning the DL4J library. On the second problem, the cancer issue, there is a piece of code that is driving me crazy because it rises an error ONLY when the code is applied on a specific column. Here is the code:

        Dataset<Row> trainingDF2 = trainingDF1.drop("PassengerId", "Name", "Ticket","Cabin");

        // Convert not numeric columns to numeric form
        StringIndexer sexIndexer = new StringIndexer()
                .setInputCol("Sex")
                .setOutputCol("sexIndex")
                .setHandleInvalid("skip"); //skip columns having nulls

        StringIndexer embarkedIndexer = new StringIndexer()
                .setInputCol("Embarked")
                .setOutputCol("embarkedIndex")
                .setHandleInvalid("skip"); //skip columns having nulls

        Dataset<Row> trainingDF21 = sexIndexer.fit(trainingDF2).transform(trainingDF2).drop("Sex");
        Dataset<Row> trainingDF3 = 
               embarkedIndexer.fit(trainingDF21).transform(trainingDF21).drop("Embarked");

        //        Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {sexIndexer, 
                  embarkedIndexer});
        //        Dataset<Row> trainingDF3 = pipeline.fit(trainingDF2).transform(trainingDF2).drop("Sex", 
        "Embarked");


        trainingDF1.show();
        trainingDF3.show();

If you look the example on the book, the author really uses a Pipline to process both columns but I got the exception so I decided to "separate" the columns to see what is going on and commented the original code. I left it there as a reference.

As you may see, there are two StringIndexer and right after that, they are used to build new DataSets. The sexIndexer, the one that process the "Sex" colum of the original Dataset, works fine. Is the other one, the embarkedIndexer, the one that processes the "Embarked" column of the original Dataset, the is driving me crazy by raising an InvocationTargetException Exception. No matter what I try, changing the order in the columns processing, processing only that one column... nothing seems to avoid the Exception. I can only try one more last thing: changing the column name... I didn't try that because its stupid. But at this point, it looks like this is a "stupid solution kind" of problem.

I hope I'm clear enough so someone can shed some light on this problem. Thanks.

I made some progress by changing the "null" values of the problematic column "Embarked", to something "not null" and I solved the Exception problem. But yet there is still a problem with .setHandleInvalid("skip"); at the indexer thai is supposed to take care of the "null" values by skipping them, I presume...

I'm adding to this post the full Exception

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:63)
Caused by: java.lang.NullPointerException
    at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:219)
    at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:207)
    at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:207)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:207)
    at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$1.apply(Metadata.scala:204)
    at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$1.apply(Metadata.scala:204)
    at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
    at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at scala.collection.immutable.Map$Map3.foreach(Map.scala:161)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
    at scala.util.hashing.MurmurHash3.unorderedHash(MurmurHash3.scala:91)
    at scala.util.hashing.MurmurHash3$.mapHash(MurmurHash3.scala:222)
    at scala.collection.GenMapLike$class.hashCode(GenMapLike.scala:35)
    at scala.collection.AbstractMap.hashCode(Map.scala:59)
    at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:206)
    at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:204)
    at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$1.apply(Metadata.scala:204)
    at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$1.apply(Metadata.scala:204)
    at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
    at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
    at scala.util.hashing.MurmurHash3.unorderedHash(MurmurHash3.scala:91)
    at scala.util.hashing.MurmurHash3$.mapHash(MurmurHash3.scala:222)
    at scala.collection.GenMapLike$class.hashCode(GenMapLike.scala:35)
    at scala.collection.AbstractMap.hashCode(Map.scala:59)
    at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:206)
    at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:204)
    at org.apache.spark.sql.types.Metadata._hashCode$lzycompute(Metadata.scala:107)
    at org.apache.spark.sql.types.Metadata._hashCode(Metadata.scala:107)
    at org.apache.spark.sql.types.Metadata.hashCode(Metadata.scala:108)
    at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:249)
    at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:206)
    at scala.collection.immutable.HashSet.elemHashCode(HashSet.scala:177)
    at scala.collection.immutable.HashSet.computeHash(HashSet.scala:186)
    at scala.collection.immutable.HashSet.$plus(HashSet.scala:84)
    at scala.collection.immutable.HashSet.$plus(HashSet.scala:35)
    at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22)
    at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20)
    at scala.collection.generic.Growable$class.loop$1(Growable.scala:53)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
    at scala.collection.mutable.SetBuilder.$plus$plus$eq(SetBuilder.scala:20)
    at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
    at scala.collection.AbstractTraversable.to(Traversable.scala:104)
    at scala.collection.TraversableOnce$class.toSet(TraversableOnce.scala:304)
    at scala.collection.AbstractTraversable.toSet(Traversable.scala:104)
    at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild$lzycompute(TreeNode.scala:89)
    at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild(TreeNode.scala:89)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:359)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.immutable.List.map(List.scala:285)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:358)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:295)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:248)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:258)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:267)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:267)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$$anonfun$apply$32.applyOrElse(Analyzer.scala:2027)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$$anonfun$apply$32.applyOrElse(Analyzer.scala:2023)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$.apply(Analyzer.scala:2023)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$.apply(Analyzer.scala:2022)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolveAndBind(ExpressionEncoder.scala:258)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:209)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
    at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
    at org.apache.spark.ml.feature.StringIndexerModel.transform(StringIndexer.scala:185)
    at Test.run(Test.java:63)
    at Main.main(Main.java:6)
    ... 5 more

I'm sorry, I forgot to say that the Exception is raised an the **transform** call. The **fit** call work fine. — ABFovox, Aug 06 '20 at 13:48
I just tried changing the column name with `trainingDF2 = trainingDF1.withColumnRenamed("Embarked","Barked");` with the same results — ABFovox, Aug 06 '20 at 13:58
Please, in addition to using comments, please edit the main question to include this information. Not everybody reads comments, and this is important information. You may get better answers if you do this. — NomadMaker, Aug 06 '20 at 14:00
I made some pre-processing on the mentioned problematic column "Embarked" by changing the "null" values to something "not null" and problem solved !! But yet there still is a problem because the `.setHandleInvalid("skip")` should take care of that by skiping them, I presume.. — ABFovox, Aug 06 '20 at 14:47
Done !!! I edited the mail post and added the progress and findings I recently made. Thanks for the advice @NomadMaker. — ABFovox, Aug 06 '20 at 15:00

Following the problems from the book Implement 10 real-world deep learning applications using Deeplearning4j

0 Answers0