2

I'm trying the run an Apache Beam pipeline in a Spring Boot project on Google Data Flow, but I keep having this error Failed to construct instance from factory method DataflowRunner#fromOptions(interfaceorg.apache.beam.sdk.options.PipelineOptions

The example I'm trying to run is a basic word count provided by the official documentation, https://beam.apache.org/get-started/wordcount-example/ . The problem is that this example is using different classes for each example, and each example has his own main function, but what I'm tried to do is run the example in a spring boot project with a class that implements the CommandLineRunner.

Spring boot main class :

 @SpringBootApplication
  public class BeamApplication {
public static void main(String[] args) {
    SpringApplication.run(BeamApplication.class, args);
}}  

CommandLineRunner:

@Component
public class Runner implements CommandLineRunner {
@Override
public void run(String[] args) throws Exception {

    WordCountOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(WordCountOptions.class);
    runWordCount(options);
}

static void runWordCount(WordCountOptions options) throws InterruptedException {

    Pipeline p = Pipeline.create(options);

    p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
            .apply(new CountWords())
            .apply(MapElements.via(new FormatAsTextFn()))
            .apply("WriteCounts", TextIO.write().to(options.getOutput()));
    p.run().waitUntilFinish();
}}

Wordcount Option:

public interface WordCountOptions extends PipelineOptions {

@Description("Path of the file to read from")
@Default.String("./src/main/resources/input.txt")
String getInputFile();
void setInputFile(String value);

@Description("path of output file")
// @Validation.Required
// @Default.String("./target/ts_output/extracted_words")
@Default.String("Path of the file to write to")
String getOutput();
void setOutput(String value);
}

Extaract words:

public class ExtractWordsFn extends DoFn<String, String> {
   public static final String TOKENIZER_PATTERN = "[^\\p{L}]+";

@ProcessElement
public void processElement(ProcessContext c) {
    for (String word : c.element().split(TOKENIZER_PATTERN)) {
        if (!word.isEmpty()) {
            c.output(word);
        }}}}

CountWords:

public  class CountWords extends    PTransform<PCollection<String>,PCollection<KV<String, Long>>> {

@Override
public PCollection<KV<String, Long>> expand(PCollection<String> lines){
    // Convert lines of text into individual words.
    PCollection<String> words = lines.apply(
            ParDo.of(new ExtractWordsFn()));

    // Count the number of times each word occurs.
    PCollection<KV<String, Long>> wordCounts =
            words.apply(Count.perElement());

    return wordCounts;
}}

When I use the Direct runner, the project works as expected and generated files in the root directory of the project,but when I try the use the Google Data Flow runner by passing these arguments --runner=DataflowRunner --project=datalake-ng --stagingLocation=gs://data_transformer/staging/ --output=gs://data_transformer/output (when using java -jar or Intellij). i get the error mentioned in the beginning of my post.

I'm using Java 11, and after looking at this Failed to construct instance from factory method DataflowRunner#fromOptions in beamSql, apache beam I tried to take my code into a fresh Java 8 Spring boot project,but the error remained the same.

When Running the project provided by the Apache beam documentation (classes with different mains), it works fine on Google Data flow and I can see the generated output in Google bucket. and my WordCountOptions interface is the same as the one provided by the official documentation.

Could the issue be caused by the CommandLineRunner ? I though that the arguments are not being received by the app, but when i debugged this line,

WordCountOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(WordCountOptions.class); 

The variable options has the right values,which are --runner=DataflowRunner --project=target-datalake-ng --stagingLocation=gs://data_transformer/staging/ --output=gs://data_transformer/output .

EDIT:

I found out that the cause of the error is a problem with gcloud authentification and the access to Google cloud bucket (Anonymous caller does not have storage.buckets.list access to project 961543751). I double checked the access and it's supposed to be set correctly since it works fine on the Beam example default project. I revoked all access and set it up again but the issue remains. i took a look at these https://github.com/googleapis/google-cloud-node/issues/2456 https://github.com/googleapis/google-cloud-ruby/issues/1588 , and I'm still trying to identify the issue, but for now it seem like a dependency version problem.

med.b
  • 465
  • 2
  • 12
  • 21
  • Yeah, based on that error seems like an authentication issue. For Dataflow to work GOOGLE_APPLICATION_CREDENTIALS environment variable has to be set as mentioned here: https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven Is it possible that you are starting up a shell where this is not set ? – chamikara Aug 06 '19 at 20:13
  • Yes i already followed the steps in that link, and the `GOOGLE_APPLICATION_CREDENTIALS JSON` file is supposed to be set correctly (I tried generating it twice). I'm thinking more it's a dependency issue. My question is could a dependency in Maven,other than `beam-runners-direct-java` or `beam-runners-google-cloud-dataflow-java`, not be used anywhere in the code, but still needed for the project to run correctly? because in the Beam Example project (generated in the "Get the WordCount code" section of the link you posted) there are a lot of other dependency and I'm not sure if they are needed. – med.b Aug 07 '19 at 08:24
  • 1
    Yes, all jars that Beam depend on are also needed. In normal execution, all Beam jars and jars Beam depend on get staged in GCS for pipeline execution. – chamikara Aug 07 '19 at 16:26

0 Answers0