1

According to this blog by the Sparkling water guys, you are now able to use the Spark ML pipelines components to build a DL model in the latest versions. I tried adding the latest versions in my build.sbt

"org.apache.spark" % "spark-mllib_2.10" % "2.0.0" % "provided",
"ai.h2o" % "sparkling-water-core_2.10" % "1.6.5" % "provided"

but no luck, trying to import org.apache.spark.ml.h2o.H2OPipeline doesn't work. The h2o package inside spark.ml doesn't seem to exist in the spark jars. Even though it seems to work in the above link as well as here.I really want to reuse my spark-mllib feature transformers to create a DL model using h2o, as shown in the blog.

Any help appreciated!

Thanks.

void
  • 2,403
  • 6
  • 28
  • 53
  • not sure if that's the problem but you're using Spark 2.0 with Sparkling Water 1.6.5, you should use Sparkling Water 2.0 which was released recently. – Mateusz Dymczyk Oct 03 '16 at 19:22
  • I doubt that's the case. https://mvnrepository.com/artifact/ai.h2o/sparkling-water-core_2.10 has only until `1.6.8` released. – void Oct 03 '16 at 19:43
  • Besides we are talking about the missing package from `org.apache.spark.ml` right? – void Oct 03 '16 at 19:46

2 Answers2

2

1) please dont use spark 2 with sw 1.6.5 - it won't work. We released sw2.0 for scala 2.11 https://mvnrepository.com/artifact/ai.h2o/sparkling-water-core_2.11

2) you're only adding SW core in your build, the classes you are looking for are in sparkling-water-ml https://mvnrepository.com/artifact/ai.h2o/sparkling-water-ml_2.11

Mateusz Dymczyk
  • 14,969
  • 10
  • 59
  • 94
  • Thanks! It would be great if you could also point me to some documentation related to the ml pipelines support for sparkling water. – void Oct 05 '16 at 09:48
  • @void unfortunately we don't have much in terms of documentation as we're still implementing this and consider it an experimental feature. For now we only support H2O's GBM and DeepLearning as part of the pipelines (here's some example code https://github.com/h2oai/sparkling-water/blob/master/examples/pipelines/hamOrSpam.script.scala). We are very open to contributions :-) – Mateusz Dymczyk Oct 07 '16 at 04:55
  • 1
    Thanks a lot for the reply! Please point me to any issues (jira) related to adding more algorithm support to pipelines? I would like to keep track. – void Oct 08 '16 at 12:06
0

I have used below versions for running H2O example with Maven pom.xml and it is working

  • Spark - 1.6
  • Sparkling water - 1.6.8
  • ai h2o - 3.10.0.8

Here is maven pom.xml (please refer to GIT repo - https://github.com/seerampavan/H2oTesting/blob/master/pom.xml)

<properties>
    <spark.version>1.6.0-cdh5.7.1</spark.version>
    <scala.version>2.10.4</scala.version>
    <scala.binary.version>2.10</scala.binary.version>
    <top.dir>${project.basedir}/..</top.dir>
    <hadoop.version>2.6.0-cdh5.7.1</hadoop.version>
</properties>

<dependencies>
    <!-- Force import of Spark's servlet API for unit tests -->
    <dependency>
        <groupId>javax.servlet</groupId>
        <artifactId>javax.servlet-api</artifactId>
        <version>3.0.1</version>
    </dependency>
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
        <!--<scope>provided</scope>-->
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>

        <exclusions>
            <exclusion>
                <!-- make sure wrong scala version is not pulled in -->
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-library</artifactId>
            </exclusion>
            <exclusion>
                <!-- make sure wrong scala version is not pulled in -->
                <groupId>org.scala-lang</groupId>
                <artifactId>scalap</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>

    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>

    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>

        <exclusions>
            <exclusion>
                <groupId>org.jpmml</groupId>
                <artifactId>pmml-model</artifactId>
            </exclusion>
        </exclusions>

    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>

    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-kafka_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>

    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <type>test-jar</type>
        <classifier>tests</classifier>

    </dependency>
    <dependency>
        <groupId>org.scalatest</groupId>
        <artifactId>scalatest_${scala.binary.version}</artifactId>
        <version>2.2.1</version>

    </dependency>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>

    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
        <exclusions>
            <exclusion>
                <groupId>log4j</groupId>
                <artifactId>log4j</artifactId>
            </exclusion>
            <exclusion>
                <groupId>javax.servlet</groupId>
                <artifactId>servlet-api</artifactId>
            </exclusion>
            <exclusion>
                <groupId>javax.servlet.jsp</groupId>
                <artifactId>jsp-api</artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.jruby</groupId>
                <artifactId>jruby-complete</artifactId>
            </exclusion>
            <exclusion>
                <groupId>org.jboss.netty</groupId>
                <artifactId>netty</artifactId>
            </exclusion>
            <exclusion>
                <groupId>io.netty</groupId>
                <artifactId>netty</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-reflect</artifactId>
        <version>2.10.5</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-web</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-scala_2.10</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-persist-s3</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-persist-hdfs</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-parquet-parser</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-genmodel</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-core</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-bindings</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-avro-parser</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-app</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>h2o-algos</artifactId>
        <version>3.10.0.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>sparkling-water-repl_2.10</artifactId>
        <version>1.6.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>sparkling-water-ml_2.10</artifactId>
        <version>1.6.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>sparkling-water-examples_2.10</artifactId>
        <version>1.6.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>sparkling-water-core_2.10</artifactId>
        <version>1.6.8</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>deepwater-backend-api</artifactId>
        <version>1.0.0</version>
    </dependency>

    <dependency>
        <groupId>joda-time</groupId>
        <artifactId>joda-time</artifactId>
        <version>2.9.2</version>
    </dependency>
    <dependency>
        <groupId>org.joda</groupId>
        <artifactId>joda-convert</artifactId>
        <version>1.8.1</version>
    </dependency>
    <dependency>
        <groupId>org.javassist</groupId>
        <artifactId>javassist</artifactId>
        <version>3.22.0-CR1</version>
    </dependency>
    <dependency>
        <groupId>gov.nist.math</groupId>
        <artifactId>jama</artifactId>
        <version>1.0.3</version>
    </dependency>
    <dependency>
        <groupId>com.google.code.gson</groupId>
        <artifactId>gson</artifactId>
        <version>2.7</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>reflections</artifactId>
        <version>0.9.11-h2o-custom</version>
    </dependency>
    <dependency>
        <groupId>ai.h2o</groupId>
        <artifactId>google-analytics-java</artifactId>
        <version>1.1.2-H2O-CUSTOM</version>
    </dependency>
    <dependency>
        <groupId>com.github.tony19</groupId>
        <artifactId>named-regexp</artifactId>
        <version>0.2.4</version>
    </dependency>
    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk-s3</artifactId>
        <version>1.11.45</version>
    </dependency>
    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk-kms</artifactId>
        <version>1.11.45</version>
    </dependency>
    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk-core</artifactId>
        <version>1.11.45</version>
    </dependency>
    <dependency>
        <groupId>org.eclipse.jetty.aggregate</groupId>
        <artifactId>jetty-servlet</artifactId>
        <version>8.2.0.v20160908</version>
    </dependency>
    <dependency>
        <groupId>org.eclipse.jetty.aggregate</groupId>
        <artifactId>jetty-server</artifactId>
        <version>8.2.0.v20160908</version>
    </dependency>
    <dependency>
        <groupId>org.eclipse.jetty.aggregate</groupId>
        <artifactId>jetty-plus</artifactId>
        <version>8.1.17.v20150415</version>
    </dependency>
</dependencies>
pavan
  • 61
  • 1
  • 5