I'm following the problems from the book Implement 10 real-world deep learning applications using Deeplearning4j by Rezaul Karim. I have the ML background but I'm learning the DL4J library. On the second problem, the cancer issue, there is a piece of code that is driving me crazy because it rises an error ONLY when the code is applied on a specific column. Here is the code:
Dataset<Row> trainingDF2 = trainingDF1.drop("PassengerId", "Name", "Ticket","Cabin");
// Convert not numeric columns to numeric form
StringIndexer sexIndexer = new StringIndexer()
.setInputCol("Sex")
.setOutputCol("sexIndex")
.setHandleInvalid("skip"); //skip columns having nulls
StringIndexer embarkedIndexer = new StringIndexer()
.setInputCol("Embarked")
.setOutputCol("embarkedIndex")
.setHandleInvalid("skip"); //skip columns having nulls
Dataset<Row> trainingDF21 = sexIndexer.fit(trainingDF2).transform(trainingDF2).drop("Sex");
Dataset<Row> trainingDF3 =
embarkedIndexer.fit(trainingDF21).transform(trainingDF21).drop("Embarked");
// Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {sexIndexer,
embarkedIndexer});
// Dataset<Row> trainingDF3 = pipeline.fit(trainingDF2).transform(trainingDF2).drop("Sex",
"Embarked");
trainingDF1.show();
trainingDF3.show();
If you look the example on the book, the author really uses a Pipline to process both columns but I got the exception so I decided to "separate" the columns to see what is going on and commented the original code. I left it there as a reference.
As you may see, there are two StringIndexer and right after that, they are used to build new DataSets. The sexIndexer, the one that process the "Sex" colum of the original Dataset, works fine. Is the other one, the embarkedIndexer, the one that processes the "Embarked" column of the original Dataset, the is driving me crazy by raising an InvocationTargetException Exception. No matter what I try, changing the order in the columns processing, processing only that one column... nothing seems to avoid the Exception. I can only try one more last thing: changing the column name... I didn't try that because its stupid. But at this point, it looks like this is a "stupid solution kind" of problem.
I hope I'm clear enough so someone can shed some light on this problem. Thanks.
I made some progress by changing the "null" values of the problematic column "Embarked", to something "not null" and I solved the Exception problem. But yet there is still a problem with .setHandleInvalid("skip");
at the indexer thai is supposed to take care of the "null" values by skipping them, I presume...
I'm adding to this post the full Exception
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:63)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:219)
at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:207)
at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$2.apply(Metadata.scala:207)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:207)
at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$1.apply(Metadata.scala:204)
at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$1.apply(Metadata.scala:204)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.Map$Map3.foreach(Map.scala:161)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
at scala.util.hashing.MurmurHash3.unorderedHash(MurmurHash3.scala:91)
at scala.util.hashing.MurmurHash3$.mapHash(MurmurHash3.scala:222)
at scala.collection.GenMapLike$class.hashCode(GenMapLike.scala:35)
at scala.collection.AbstractMap.hashCode(Map.scala:59)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:206)
at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:204)
at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$1.apply(Metadata.scala:204)
at org.apache.spark.sql.types.Metadata$$anonfun$org$apache$spark$sql$types$Metadata$$hash$1.apply(Metadata.scala:204)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
at scala.util.hashing.MurmurHash3.unorderedHash(MurmurHash3.scala:91)
at scala.util.hashing.MurmurHash3$.mapHash(MurmurHash3.scala:222)
at scala.collection.GenMapLike$class.hashCode(GenMapLike.scala:35)
at scala.collection.AbstractMap.hashCode(Map.scala:59)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:206)
at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$hash(Metadata.scala:204)
at org.apache.spark.sql.types.Metadata._hashCode$lzycompute(Metadata.scala:107)
at org.apache.spark.sql.types.Metadata._hashCode(Metadata.scala:107)
at org.apache.spark.sql.types.Metadata.hashCode(Metadata.scala:108)
at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:249)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:206)
at scala.collection.immutable.HashSet.elemHashCode(HashSet.scala:177)
at scala.collection.immutable.HashSet.computeHash(HashSet.scala:186)
at scala.collection.immutable.HashSet.$plus(HashSet.scala:84)
at scala.collection.immutable.HashSet.$plus(HashSet.scala:35)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20)
at scala.collection.generic.Growable$class.loop$1(Growable.scala:53)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.SetBuilder.$plus$plus$eq(SetBuilder.scala:20)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toSet(TraversableOnce.scala:304)
at scala.collection.AbstractTraversable.toSet(Traversable.scala:104)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild$lzycompute(TreeNode.scala:89)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild(TreeNode.scala:89)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:359)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:358)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:295)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:248)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:258)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:267)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$$anonfun$apply$32.applyOrElse(Analyzer.scala:2027)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$$anonfun$apply$32.applyOrElse(Analyzer.scala:2023)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$.apply(Analyzer.scala:2023)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$.apply(Analyzer.scala:2022)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolveAndBind(ExpressionEncoder.scala:258)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:209)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
at org.apache.spark.ml.feature.StringIndexerModel.transform(StringIndexer.scala:185)
at Test.run(Test.java:63)
at Main.main(Main.java:6)
... 5 more