I am new to Apache Beam and trying to run a sample read and write program using DirectRunner and DataflowRunner. In my use case, there are few CLI args and to achieve this I created one interface "CustomOptions.java" which extends PipelineOptions.
Using DirectRunner the programs runs fine but with DataflowRunner, it says "interface CustomOptions missing a property named 'project'".
pom.xml
<dependencies>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.2.0</version>
<type>maven-plugin</type>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>2.16.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
<version>2.16.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>2.16.0</version>
</dependency>
</dependencies>
CustomOptions.java (Interface)
import org.apache.beam.sdk.options.PipelineOptions;
public interface CustomOptions extends PipelineOptions {
String getInput();
void setInput(String value);
String getOutput();
void setOutput(String value);
}
WordCount.java
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
public class WordCount {
public static void main(String args[]) {
PipelineOptionsFactory.register(CustomOptions.class);
CustomOptions options = PipelineOptionsFactory.fromArgs(args).as(CustomOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("Read", TextIO.read().from(options.getInput()))
.apply("Write", TextIO.write().to(options.getOutput()));
p.run();
}
}
Commands:
DirectRunner (Working) : java -cp jarPath WordCount --input=inputPath --output=outputPath
DataflowRunner (Not Working) : java -cp jarPath WordCount --input=inputPath --output=outputPath --runner=DataflowRunner --stagingLocation=gs://<tmp_path> --project=<projectId>
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Class interface CustomOptions missing a property named 'project'.
at org.apache.beam.sdk.options.PipelineOptionsFactory.parseObjects(PipelineOptionsFactory.java:1625)
at org.apache.beam.sdk.options.PipelineOptionsFactory.access$400(PipelineOptionsFactory.java:115)
at org.apache.beam.sdk.options.PipelineOptionsFactory$Builder.as(PipelineOptionsFactory.java:298)
at WordCount.main(WordCount.java:13)
Second thing that i tried is to extend CustomOptions with DataflowPipelineOptions instead of PipelineOptions. Using this also, i am getting an error:
Exception in thread "main" java.lang.IllegalArgumentException: No filesystem found for scheme gs
at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:463)
at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:533)
at org.apache.beam.sdk.io.FileBasedSink.convertToFileResourceIfPossible(FileBasedSink.java:215)
at org.apache.beam.sdk.io.TextIO$TypedWrite.to(TextIO.java:734)
at org.apache.beam.sdk.io.TextIO$Write.to(TextIO.java:1069)
at WordCount.main(WordCount.java:15)
Second trial comes with one more question that same code can not be executed using DirectRunner and DataflowRunner. Because in second case "projectId" is a mandatory argument which will not be specified in DirectRunner.