1

I want to run a pipeline with Spark runner and data is stored on a remote machine. The following command has been used to submit the job:

./spark-submit   --class org.apache.beam.examples.WordCount   --master spark://192.168.1.214:6066   --deploy-mode cluster   --supervise   --executor-memory 2G   --total-executor-cores 4 hdfs://192.168.1.214:9000/input/word-count-ck-0.1.jar --runner=SparkRunner

It is creating the following response:

Running Spark using the REST application submission protocol.
        Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
        17/06/12 14:44:49 INFO RestSubmissionClient: Submitting a request to launch an application in spark://192.168.1.214:6066.
        17/06/12 14:44:49 INFO RestSubmissionClient: Submission successfully created as driver-20170612200920-0006. Polling submission state...
        17/06/12 14:44:49 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20170612200920-0006 in spark://192.168.1.214:6066.
        17/06/12 14:44:49 INFO RestSubmissionClient: State of driver driver-20170612200920-0006 is now RUNNING.
        17/06/12 14:44:49 INFO RestSubmissionClient: Driver is running on worker worker-20170612193258-192.168.1.214-37336 at 192.168.1.214:37336.
        17/06/12 14:44:49 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
        {
          "action" : "CreateSubmissionResponse",
          "message" : "Driver successfully submitted as driver-20170612200920-0006",
          "serverSparkVersion" : "1.6.3",
          "submissionId" : "driver-20170612200920-0006",
          "success" : true
        }

Howewever,the job is stuck in 'RUNNING' status with stderror displaying the following exception along with other details:

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
        at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
    Caused by: java.lang.IllegalStateException: Unable to find registrar for hdfs
        at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:447)
        at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:523)
        at org.apache.beam.sdk.io.FileBasedSink.convertToFileResourceIfPossible(FileBasedSink.java:204)
        at org.apache.beam.sdk.io.TextIO$Write.to(TextIO.java:294)
        at org.apache.beam.examples.WordCount.main(WordCount.java:132)
        ... 6 more

The following are the plugins and dependencies i used in my project:

 <packaging>jar</packaging>

        <properties>
        <beam.version>2.0.0</beam.version>
        <surefire-plugin.version>2.20</surefire-plugin.version>
    </properties>

    <repositories>
        <repository>
            <id>apache.snapshots</id>
            <name>Apache Development Snapshot Repository</name>
            <url>https://repository.apache.org/content/repositories/snapshots/</url>
            <releases>
                <enabled>false</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-spark</artifactId>
            <version>${beam.version}</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-hadoop-file-system</artifactId>
            <version>${beam.version}</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>1.6.3</version>
            <scope>runtime</scope>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>jul-to-slf4j</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-flink_2.10</artifactId>
            <version>${beam.version}</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.module</groupId>
            <artifactId>jackson-module-scala_2.10</artifactId>
            <version>2.8.8</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-core</artifactId>
            <version>${beam.version}</version>
    <!--         <exclusions>
            <exclusion>
            <artifactId>beam-sdks-java-core</artifactId>
            </exclusion>
            </exclusions> -->
        </dependency>

        <!-- Adds a dependency on the Beam Google Cloud Platform IO module. -->
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
            <version>${beam.version}</version>
        </dependency>

        <!-- Dependencies below this line are specific dependencies needed by the examples code. -->
        <dependency>
            <groupId>com.google.api-client</groupId>
            <artifactId>google-api-client</artifactId>
            <version>1.22.0</version>
            <exclusions>
                <!-- Exclude an old version of guava that is being pulled
                     in by a transitive dependency of google-api-client -->
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.google.apis</groupId>
            <artifactId>google-api-services-bigquery</artifactId>
            <version>v2-rev295-1.22.0</version>
            <exclusions>
                <!-- Exclude an old version of guava that is being pulled
                     in by a transitive dependency of google-api-client -->
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.google.http-client</groupId>
            <artifactId>google-http-client</artifactId>
            <version>1.22.0</version>
            <exclusions>
                <!-- Exclude an old version of guava that is being pulled
                     in by a transitive dependency of google-api-client -->
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.google.apis</groupId>
            <artifactId>google-api-services-pubsub</artifactId>
            <version>v1-rev10-1.22.0</version>
            <exclusions>
                <!-- Exclude an old version of guava that is being pulled
                     in by a transitive dependency of google-api-client -->
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>2.4</version>
        </dependency>

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>20.0</version>
        </dependency>

        <!-- Add slf4j API frontend binding with JUL backend -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.14</version>
        </dependency>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-jdk14</artifactId>
            <version>1.7.14</version>
            <!-- When loaded at runtime this will wire up slf4j to the JUL backend -->
            <scope>runtime</scope>
        </dependency>

        <!-- Hamcrest and JUnit are required dependencies of PAssert,
             which is used in the main code of DebuggingWordCount example. -->
        <dependency>
            <groupId>org.hamcrest</groupId>
            <artifactId>hamcrest-all</artifactId>
            <version>1.3</version>
        </dependency>

        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>

        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-hadoop-common</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-hadoop-file-system</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-hadoop-input-format</artifactId>
            <version>${beam.version}</version>
        </dependency>

        <!-- The DirectRunner is needed for unit tests. -->
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-direct-java</artifactId>
            <version>${beam.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.0.0-alpha2</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>${surefire-plugin.version}</version>
                <configuration>
                    <parallel>all</parallel>
                    <threadCount>4</threadCount>
                    <redirectTestOutputToFile>true</redirectTestOutputToFile>
                </configuration>
                <dependencies>
                    <dependency>
                        <groupId>org.apache.maven.surefire</groupId>
                        <artifactId>surefire-junit47</artifactId>
                        <version>${surefire-plugin.version}</version>
                    </dependency>
                </dependencies>
            </plugin>

            <!-- Ensure that the Maven jar plugin runs before the Maven
              shade plugin by listing the plugin higher within the file. -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
            </plugin>



            <!--
              Configures `mvn package` to produce a bundled jar ("fat jar") for runners
              that require this for job submission to a cluster.
            -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.0.0</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/LICENSE</exclude>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>

        <pluginManagement>
            <plugins>
                <plugin>
                    <groupId>org.codehaus.mojo</groupId>
                    <artifactId>exec-maven-plugin</artifactId>
                    <version>1.4.0</version>
                    <configuration>
                        <cleanupDaemonThreads>false</cleanupDaemonThreads>
                    </configuration>
                </plugin>
            </plugins>
        </pluginManagement>
    </build>
    </project>

The fatjar contains HadoopFileSystemRegistrar. The following is the source code of the WordCount class:

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.beam.examples;

import java.util.Collections;

import org.apache.beam.examples.common.ExampleUtils;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.hdfs.HadoopFileSystemOptions;
import org.apache.beam.sdk.metrics.Counter;
import org.apache.beam.sdk.metrics.Metrics;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
//import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.transforms.Count;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.hadoop.conf.Configuration;

/**
 * An example that counts words in Shakespeare and includes Beam best practices.
 */
public class WordCount {
    static class ExtractWordsFn extends DoFn<String, String> {
        private final Counter emptyLines = Metrics
                .counter(ExtractWordsFn.class, "emptyLines");

        @ProcessElement
        public void processElement(ProcessContext c) {
            if (c.element().trim().isEmpty()) {
                emptyLines.inc();
            }

            // Split the line into words.
            String[] words = c.element().split(ExampleUtils.TOKENIZER_PATTERN);

            // Output each word encountered into the output PCollection.
            for (String word : words) {
                if (!word.isEmpty()) {
                    c.output(word);
                }
            }
        }
    }

    /**
     * A SimpleFunction that converts a Word and Count into a printable string.
     */
    public static class FormatAsTextFn extends SimpleFunction<KV<String, Long>, String> {
        @Override
        public String apply(KV<String, Long> input) {
            return input.getKey() + ": " + input.getValue();
        }
    }

    public static class CountWords extends PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
        @Override
        public PCollection<KV<String, Long>> expand(PCollection<String> lines) {

            // Convert lines of text into individual words.
            PCollection<String> words = lines.apply(ParDo.of(new ExtractWordsFn()));

            // Count the number of times each word occurs.
            PCollection<KV<String, Long>> wordCounts = words
                    .apply(Count.<String>perElement());

            return wordCounts;
        }
    }

    /**
     * Options supported by {@link WordCount}. Concept #4: Defining your own
     * configuration options. Here, you can add your own arguments to be
     * processed by the command-line parser, and specify default values for
     * them. You can then access the options values in your pipeline code.
     * Inherits standard configuration options.
     */
    public interface WordCountOptions extends HadoopFileSystemOptions {

        /**
         * By default, this example reads from a public dataset containing the
         * text of King Lear. Set this option to choose a different input file
         * or glob.
         */
        @Description("Path of the file to read from")
        @Default.String("hdfs://192.168.1.214:9000/beamWorks/kinglear.txt")
        String getInputFile();

        void setInputFile(String value);

        /**
         * Set this required option to specify where to write the output.
         */
        @Description("/home/ankit/kinglear_chandan.txt ")
        @Default.String("hdfs://192.168.1.214:9000/beamWorks/ckoutput/ck")
        @Required
        String getOutput();

        void setOutput(String value);
    }

    public static void main(String[] args) {
           String[] args1 =new String[]{ "--hdfsConfiguration=[{\"fs.defaultFS\" : \"hdfs://192.168.1.214:9000\"}]","--runner=SparkRunner"};
           WordCountOptions options = PipelineOptionsFactory
        .fromArgs(args1)
        .withValidation()
        .as(WordCountOptions.class);
    Pipeline p = Pipeline.create(options); 
        p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
        .apply(new CountWords())
                .apply(MapElements.via(new FormatAsTextFn()))
                .apply("WriteCounts", TextIO.write().to(options.getOutput()));
        p.run().waitUntilFinish();
    }
}
Chandan Kumar
  • 99
  • 1
  • 8

4 Answers4

2

If you have setup fs.DefaultFS as described/linked in the accepted answer, but are still encountering the issue (which was the case for me), the root cause may be different.

It could namely be due to the Java ServiceLoader not being able to find the HadoopFileSystemRegistrar. In that case, you may have to modify the way your executable jar is assembled. This answer to a similar question (gs instead of hdfs) provided the solution to me.

Edit: I'm not using spark but Flink, and run the pipeline's main class directly with 'yarn jar'.

Ivan Plantevin
  • 371
  • 4
  • 11
1

I had the same issue. Please take a look to this Jira ticket https://issues.apache.org/jira/projects/BEAM/issues/BEAM-2429 and set the parameter fs.defaultFS to handle hdfs path. Hope this will help you.

François
  • 36
  • 2
  • Hi,thank you for the comment. i tried the options suggested above (i am updating the source code for WordCount.java accordingly for you to have a look at ).However, it did not have any effect(The error is still 'unable to find registrar for hdfs'). – Chandan Kumar Jun 20 '17 at 09:50
  • The error turns into 'java.lang.IllegalStateException: Scheme: [file] has conflicting filesystems: [org.apache.beam.sdk.io.LocalFileSystem, org.apache.beam.sdk.io.hdfs.HadoopFileSystem]' if i remove the second filter from the maven-shade-plugin and build the fat jar.This is the only difference i have observed. – Chandan Kumar Jun 20 '17 at 10:23
  • `String[] args = new String[]{ "--hdfsConfiguration=[{\"fs.defaultFS\" : \"hdfs://host:port\"}]"}; options = PipelineOptionsFactory .fromArgs(args) .withValidation() .as(HadoopFileSystemOptions.class); Pipeline pipeline = Pipeline.create(options);` – François Jun 20 '17 at 14:02
  • my mistake, i was not supplying the value for host:port for parameter 'fs.defaultFS' . However, the error has now changed to: 'java.lang.RuntimeException: java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy'. i supplied version 1.6.0/1.6.2/1.6.3 for artifactId spark-streaming_2.10 but it does not help. – Chandan Kumar Jun 21 '17 at 10:16
  • Thanks a lot. It finally worked.i have updated the source code and pom.xml to reflect what worked for me. It worked with apache spark version 1.6.3 though.Thanks to all of you for putting your efforts.Cheers :) – Chandan Kumar Jun 21 '17 at 13:19
0

I don't see a dependency on beam-sdks-java-core, which is a pretty important package. Perhaps it's getting pulled in transitively?

The maven archetypes could be a good place to start and compare with your dependencies to check the differences - https://beam.apache.org/get-started/quickstart-java/ has the info on how to generate a project using the word count archetype.

I would:

  1. Try depending on beam-sdks-java-core explicitly
  2. Either start again using one of the archetypes (starter can be good too) or compare your pom with one generated by those to look for other important differences in your dependencies.
  3. See if gcs or any of the other file systems work - this would indicate the problem is not with finding the hdfs registrar, but rather that nothing is getting registered at all (and is probably an indicator that you're missing important packages)
  4. I'm not familiar with how spark-submit works, but I note that the beam quickstart shows a job getting executed with the command line "mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner", rather than using spark-submit. Might be worth checking that out to see if that helps/returns a more useful error.

Hopefully one of those suggestions helps.

Stephen Sisk
  • 155
  • 1
  • 11
  • ,thank you for looking into the question.i tried suggestion 1, unforunately it did not help. Regarding rest of the suggestions, i would like to point out that i am already able to successfully execute the job on the same remote machine where hadoop ecosystem is, by referring to the files available on its underlying file system,rather than the hadoop file system(hdfs),using the same spark-submit command.It is only generating error when i am trying to access resources available on hdfs,i.e., 'hdfs://host:port/path to the file to read from' – Chandan Kumar Jun 13 '17 at 09:21
  • And yes,i have verified that 'beam-sdks-java-core' dependency is being added transitively. – Chandan Kumar Jun 13 '17 at 09:24
0

You have to modify the following in your wordcount Example to make it work

  • your WordCountOptions has to extend from HadoopFileSystemOptions instead of PipelineOptions
  • Add the following line between the option creation and the pipeline creation

    WordCountOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(WordCountOptions.class);
    options.setHdfsConfiguration(Collections.singletonList(new Configuration()));
    Pipeline p = Pipeline.create(options);
    

This is my pom file

<properties>
    <beam.version>2.0.0</beam.version>
    <surefire-plugin.version>2.20</surefire-plugin.version>
</properties>

<repositories>
    <repository>
        <id>apache.snapshots</id>
        <name>Apache Development Snapshot Repository</name>
        <url>https://repository.apache.org/content/repositories/snapshots/</url>
        <releases>
            <enabled>false</enabled>
        </releases>
        <snapshots>
            <enabled>true</enabled>
        </snapshots>
    </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-spark</artifactId>
        <version>${beam.version}</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-io-hadoop-file-system</artifactId>
        <version>${beam.version}</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.10</artifactId>
        <version>1.6.2</version>
        <scope>runtime</scope>
        <exclusions>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>jul-to-slf4j</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-flink_2.10</artifactId>
        <version>${beam.version}</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>com.fasterxml.jackson.module</groupId>
        <artifactId>jackson-module-scala_2.10</artifactId>
        <version>2.8.8</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-core</artifactId>
        <version>${beam.version}</version>
    </dependency>

    <!-- Adds a dependency on the Beam Google Cloud Platform IO module. -->
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
        <version>${beam.version}</version>
    </dependency>

    <!-- Dependencies below this line are specific dependencies needed by the examples code. -->
    <dependency>
        <groupId>com.google.api-client</groupId>
        <artifactId>google-api-client</artifactId>
        <version>1.22.0</version>
        <exclusions>
            <!-- Exclude an old version of guava that is being pulled
                 in by a transitive dependency of google-api-client -->
            <exclusion>
                <groupId>com.google.guava</groupId>
                <artifactId>guava-jdk5</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>com.google.apis</groupId>
        <artifactId>google-api-services-bigquery</artifactId>
        <version>v2-rev295-1.22.0</version>
        <exclusions>
            <!-- Exclude an old version of guava that is being pulled
                 in by a transitive dependency of google-api-client -->
            <exclusion>
                <groupId>com.google.guava</groupId>
                <artifactId>guava-jdk5</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>com.google.http-client</groupId>
        <artifactId>google-http-client</artifactId>
        <version>1.22.0</version>
        <exclusions>
            <!-- Exclude an old version of guava that is being pulled
                 in by a transitive dependency of google-api-client -->
            <exclusion>
                <groupId>com.google.guava</groupId>
                <artifactId>guava-jdk5</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>com.google.apis</groupId>
        <artifactId>google-api-services-pubsub</artifactId>
        <version>v1-rev10-1.22.0</version>
        <exclusions>
            <!-- Exclude an old version of guava that is being pulled
                 in by a transitive dependency of google-api-client -->
            <exclusion>
                <groupId>com.google.guava</groupId>
                <artifactId>guava-jdk5</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>joda-time</groupId>
        <artifactId>joda-time</artifactId>
        <version>2.4</version>
    </dependency>

    <dependency>
        <groupId>com.google.guava</groupId>
        <artifactId>guava</artifactId>
        <version>20.0</version>
    </dependency>

    <!-- Add slf4j API frontend binding with JUL backend -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>1.7.14</version>
    </dependency>

    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-jdk14</artifactId>
        <version>1.7.14</version>
        <!-- When loaded at runtime this will wire up slf4j to the JUL backend -->
        <scope>runtime</scope>
    </dependency>

    <!-- Hamcrest and JUnit are required dependencies of PAssert,
         which is used in the main code of DebuggingWordCount example. -->
    <dependency>
        <groupId>org.hamcrest</groupId>
        <artifactId>hamcrest-all</artifactId>
        <version>1.3</version>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-io-hadoop-common</artifactId>
        <version>${beam.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-io-hadoop-file-system</artifactId>
        <version>${beam.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-io-hadoop-input-format</artifactId>
        <version>${beam.version}</version>
    </dependency>

    <!-- The DirectRunner is needed for unit tests. -->
    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-direct-java</artifactId>
        <version>${beam.version}</version>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.0.0-alpha2</version>
    </dependency>
</dependencies>
<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-surefire-plugin</artifactId>
            <version>${surefire-plugin.version}</version>
            <configuration>
                <parallel>all</parallel>
                <threadCount>4</threadCount>
                <redirectTestOutputToFile>true</redirectTestOutputToFile>
            </configuration>
            <dependencies>
                <dependency>
                    <groupId>org.apache.maven.surefire</groupId>
                    <artifactId>surefire-junit47</artifactId>
                    <version>${surefire-plugin.version}</version>
                </dependency>
            </dependencies>
        </plugin>

        <!-- Ensure that the Maven jar plugin runs before the Maven
          shade plugin by listing the plugin higher within the file. -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
        </plugin>



        <!--
          Configures `mvn package` to produce a bundled jar ("fat jar") for runners
          that require this for job submission to a cluster.
        -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.0.0</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <filters>
                            <filter>
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/LICENSE</exclude>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>
                        <transformers>
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>

    <pluginManagement>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>1.4.0</version>
                <configuration>
                    <cleanupDaemonThreads>false</cleanupDaemonThreads>
                </configuration>
            </plugin>
        </plugins>
    </pluginManagement>
</build>

That solved my issue. I would like to see an answer from someone with more experience on Apache Beam. From a design perspective, considering the Apache Beam's purpose where they expect to add one extra level of abstraction for the engine/source/target/transformation, is a little bit frustrating that the framework itself can't infer the source type and do this kind of basic instantiation for itself.

I hope this help you.

hlagos
  • 7,690
  • 3
  • 23
  • 41
  • hi @lake,could you share your entire pome with me? i tried your suggestion.However, it is giving the following java.lang.IllegalStateException: Scheme: [file] has conflicting filesystems: [org.apache.beam.sdk.io.LocalFileSystem, org.apache.beam.sdk.io.hdfs.HadoopFileSystem].And, If i exclude 'org.apache.beam.sdk.io.LocalFileSystem class' through maven shade plugin; it is throwing:'java.lang.NoClassDefFoundError: org/apache/beam/sdk/io/LocalFileSystem'. Thank you in advance. – Chandan Kumar Jun 14 '17 at 07:21
  • Hi @lake,thank you so much for the same. However, to my surprise,even pasting your shared pome.xml in entirety doesnot have any effect on the situation discussed above :). – Chandan Kumar Jun 14 '17 at 08:55
  • I don't see where you are passing the hdfs input and output. Try to pass them in the parameters using hdfs://host/path – hlagos Jun 14 '17 at 09:03
  • Hi, i am using the annotations @Default.String(""hdfs://path/to/InputFile") and @Default.String(""hdfs://path/to/outputFile") inside interface WordCountOptions extends HadoopFileSystemOptions body to specify hdfs input/output. – Chandan Kumar Jun 14 '17 at 09:24
  • even supplying the --inputFile and --output parameters does not have had any effect on the status of the error :) – Chandan Kumar Jun 14 '17 at 10:05
  • that is strange, could you please update your post with the spark submit command and your wordcount class? Also, could you try run ur process in non cluster mode? – hlagos Jun 14 '17 at 12:33
  • Hi @lake i updated the post. Kindly notice that the WordCount code has other classes as dependency.i did modify the pom.xml adapted from your answer in accordance with the chain of exceptions i got in the process of reruns. In parallel to this,could you please upload your project on Github and share the link along with the spark submit command you are using to execute it? – Chandan Kumar Jun 15 '17 at 07:21
  • hey! I will today, i was out for a few days – hlagos Jun 19 '17 at 13:08
  • Kindly update here when you have done the same.Thank you in advance :) – Chandan Kumar Jun 20 '17 at 10:02