0

I am new to Apache Beam and trying to run a sample read and write program using DirectRunner and DataflowRunner. In my use case, there are few CLI args and to achieve this I created one interface "CustomOptions.java" which extends PipelineOptions.

Using DirectRunner the programs runs fine but with DataflowRunner, it says "interface CustomOptions missing a property named 'project'".

pom.xml

<dependencies>
    <dependency>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.2.0</version>
        <type>maven-plugin</type>
    </dependency>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-core</artifactId>
        <version>2.16.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
        <version>2.16.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-direct-java</artifactId>
        <version>2.16.0</version>
    </dependency>

</dependencies>

CustomOptions.java (Interface)

import org.apache.beam.sdk.options.PipelineOptions;

public interface CustomOptions extends PipelineOptions {

    String getInput();
    void setInput(String value);

    String getOutput();
    void setOutput(String value);
}

WordCount.java

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;

public class WordCount {

    public static void main(String args[]) {
        PipelineOptionsFactory.register(CustomOptions.class);
        CustomOptions options = PipelineOptionsFactory.fromArgs(args).as(CustomOptions.class);
        Pipeline p = Pipeline.create(options);

        p.apply("Read", TextIO.read().from(options.getInput()))
                .apply("Write", TextIO.write().to(options.getOutput()));

        p.run();
    }
}

Commands:

DirectRunner (Working) : java -cp jarPath WordCount --input=inputPath --output=outputPath
DataflowRunner (Not Working) : java -cp jarPath WordCount --input=inputPath --output=outputPath --runner=DataflowRunner --stagingLocation=gs://<tmp_path> --project=<projectId>

Error:

Exception in thread "main" java.lang.IllegalArgumentException: Class interface CustomOptions missing a property named 'project'.
    at org.apache.beam.sdk.options.PipelineOptionsFactory.parseObjects(PipelineOptionsFactory.java:1625)
    at org.apache.beam.sdk.options.PipelineOptionsFactory.access$400(PipelineOptionsFactory.java:115)
    at org.apache.beam.sdk.options.PipelineOptionsFactory$Builder.as(PipelineOptionsFactory.java:298)
    at WordCount.main(WordCount.java:13)

Second thing that i tried is to extend CustomOptions with DataflowPipelineOptions instead of PipelineOptions. Using this also, i am getting an error:

Exception in thread "main" java.lang.IllegalArgumentException: No filesystem found for scheme gs
    at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:463)
    at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:533)
    at org.apache.beam.sdk.io.FileBasedSink.convertToFileResourceIfPossible(FileBasedSink.java:215)
    at org.apache.beam.sdk.io.TextIO$TypedWrite.to(TextIO.java:734)
    at org.apache.beam.sdk.io.TextIO$Write.to(TextIO.java:1069)
    at WordCount.main(WordCount.java:15)

Second trial comes with one more question that same code can not be executed using DirectRunner and DataflowRunner. Because in second case "projectId" is a mandatory argument which will not be specified in DirectRunner.

Jitesh Sharma
  • 65
  • 1
  • 7
  • In the first case can you remove the --project= – Jayadeep Jayaraman Nov 26 '19 at 14:42
  • Just to clairify, were you extending DataflowPipelineOptions and running it on the DataflowRunner when you saw that "No filesystem found for scheme gs" error? I would not expect that error to occur if you are extending DataflowPipelineOptions. Would you mind please clarifying (1) which of the two command lines you used, and (2) which options class you were extending when you saw that error? – Alex Amato Nov 26 '19 at 22:25
  • I'm not 100% sure if you can use DataflowPipelineOptions with DirectRunner. If it requires you to pass in parameters like --project in DirectRunner, it might work if you pass in an unused placeholder value. Though I think the --project param is used for the sources and sinks, if they reading/writing data to a GCP service. In which case you will need to specify a valid value. If that fails, you could have two main programs which swap out the option class, for DataflowRunner and DirectRunner. – Alex Amato Nov 26 '19 at 22:29
  • @JayadeepJayaraman If I remove --project=, it throws another exception for key --stagingLocation=. It states CustomOptions.java do not have key "stagingLocation". – Jitesh Sharma Nov 28 '19 at 04:12
  • @AlexAmato Yes you got it right, "No filesystem found for scheme gs" error is coming when I am extending DataflowPipelineOptions. And I am using both commands, one with DirectRunner and another with DataflowRunner. – Jitesh Sharma Nov 28 '19 at 04:21

1 Answers1

2

With few trials and errors, I think I got the right thing. I am using same java classes as mentioned in the question, i.e. extending CustomOptions.java with PipelineOptions. Only change that I did was in pom.xml.

Now I am using maven shade plugin with few extra configuration instead of maven assembly plugin. With these what I achieved: 1. Same jar can be used with DirectRunner or DataflowRunner. 2. Stating which main class I want to execute from command line.

Previous 'pom.xml':

<build>
    <plugins>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.2.0</version>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id> <!-- this is used for inheritance merges -->
                    <phase>package</phase> <!-- bind to the packaging phase -->
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.2.0</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <transformers>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                            <!-- add Main-Class to manifest file -->
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <mainClass>com.dh.WordCount</mainClass>
                            </transformer>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>

    </plugins>
</build>

<dependencies>
    <dependency>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.2.0</version>
        <type>maven-plugin</type>
    </dependency>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-core</artifactId>
        <version>2.16.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
        <version>2.16.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-direct-java</artifactId>
        <version>2.16.0</version>
    </dependency>

</dependencies>

New 'pom.xml':

<build>
    <plugins>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.2.0</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <transformers>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"/>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>

    </plugins>
</build>

<dependencies>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-sdks-java-core</artifactId>
        <version>2.16.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
        <version>2.16.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.beam</groupId>
        <artifactId>beam-runners-direct-java</artifactId>
        <version>2.16.0</version>
    </dependency>

</dependencies>

This was made possible when I read this answer: Google Dataflow "No filesystem found for scheme gs"

Jitesh Sharma
  • 65
  • 1
  • 7