3

I'm very new to Apache Beam and my Java skills are quite low, but I'd like to understand why my simple entries manipulations work so slow with Apache Beam.

What I'm trying to perform is the following: I have a CSV file with 1 million of records (Alexa top 1 million sites) of the following scheme: NUMBER,DOMAIN (e.g. 1,google.com), I want to “strip” the first (number) field and get only the domain part. My code for this pipeline is the following:

package misc.examples;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.metrics.Counter;
import org.apache.beam.sdk.metrics.Metrics;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;

public class Example {

  static class ExtractDomainsFn extends DoFn<String, String> {
    private final Counter domains = Metrics.counter(ExtractDomainsFn.class, "domains");

    @ProcessElement
    public void processElement(ProcessContext c) {
      if (c.element().contains(",")) {
        domains.inc();

        String domain = c.element().split(",")[1];
        c.output(domain);
      }
    }
  }

  public static void main(String[] args) {
    Pipeline p = Pipeline.create();

    p.apply("ReadLines", TextIO.read().from("./top-1m.csv"))
     .apply("ExtractDomains", ParDo.of(new ExtractDomainsFn()))
     .apply("WriteDomains", TextIO.write().to("domains"));

    p.run().waitUntilFinish();
  }
}

When I execute this code with Maven it takes more than four minutes to succeed on my laptop:

$ mvn compile exec:java -Dexec.mainClass=misc.examples.Example
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building my-example 1.0.0
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ my-example ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /…/src/main/resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ my-example ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ my-example ---
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:36 min
[INFO] Finished at: 2017-06-24T15:20:33+03:00
[INFO] Final Memory: 31M/1685M
[INFO] ------------------------------------------------------------------------

While the simple cut(1) works before you can even blink:

$time cut -d, -f2 top-1m.csv > domains

real    0m0.171s
user    0m0.140s
sys     0m0.028s

So, is such Apache Beam behavior considered acceptable (probably it’d work comparably better on larger amounts of data) or is my code just inefficient?

01-07-2014 Update:

As Kenn Knowles suggested, I've tried to run the pipeline on other runner than the DirectRunner — on the DataflowRunner. So the updated code looks like the following:

package misc.examples;

import org.apache.beam.runners.dataflow.DataflowRunner;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;

public class Example {

  static class ExtractDomainsFn extends DoFn<String, String> {
    @ProcessElement
    public void processElement(ProcessContext c) {
      if (c.element().contains(",")) {
        String domain = c.element().split(",")[1];
        c.output(domain);
      }
    }
  }

  public static void main(String[] args) {
    PipelineOptions options = PipelineOptionsFactory.create();
    DataflowPipelineOptions dataflowOptions = options.as(DataflowPipelineOptions.class);
    dataflowOptions.setRunner(DataflowRunner.class);
    dataflowOptions.setProject("my-gcp-project-id");
    Pipeline p = Pipeline.create(options);
    p.apply("ReadLines", TextIO.read().from("gs://my-gcs-bucket/top-1m.csv"))
     .apply("ExtractDomains", ParDo.of(new ExtractDomainsFn()))
     .apply("WriteDomains", TextIO.write().to("gs://my-gcs-bucket/output/"));

    p.run().waitUntilFinish();
  }
}

Elapsed time running on Google Dataflow is smaller compared to the Direct runner but still slow enough — a bit more that 3 minutes:

Google Dataflow Job

Google Dataflow Job Logs

Petr Razumov
  • 1,952
  • 2
  • 17
  • 32
  • From what I understood, Apache Beam load the whole text file in-memory into a `PCollection` and then continues operating on it. On the other hand, Linux streams are very efficient and not loading the whole file at once which indeed is not necessary here – UninformedUser Jun 24 '17 at 15:10
  • And why do you think that it work comparably better on larger data? For sure not as long as you have just this simple task and use the direct runner. The idea of Apache Beam is to generate a generic pipeline which could be run on different frameworks. And of course the *direct runner* is simply the Java in-memory way - using a runner for Spark or Flink is indeed more efficient as those are frameworks for BigData processing. Cheers. – UninformedUser Jun 25 '17 at 05:43
  • Updated my answer according to your updated question. – Kenn Knowles Jul 03 '17 at 03:59
  • What version of the SDK are you using? We have isolated a major slowdown in the direct runner at HEAD, but version 2.1.0 is much faster. If you are experiencing the slowdown with 2.1.0 we would like to know about that. – Kenn Knowles Sep 12 '17 at 22:31
  • Nevermind, I see 2.0.0 in your screenshot. In this case, we actually don't have a handle on the slowdown. If you would care to add details on the linked bug, we definitely want to get to the bottom of this. – Kenn Knowles Sep 13 '17 at 02:26

1 Answers1

4

Apache Beam provides correct event time processing and portability over massive-scale data processing engines such as Apache Flink, Apache Spark, Apache Apex, and Google Cloud Dataflow.

Here, it would seem you are running your pipeline in the default DirectRunner which is a way to test correctness of a pipeline at small scale (where "small" means anything not using multiple machines). To test correctness, the runner also performs extra tasks to help ensure correctness like checking your serialization (Coder) and putting elements in random order to make sure your pipeline isn't order-dependent.

The DirectRunner does not necessarily bring all the values into memory at once, but has a streaming model of execution so it works also with unbounded datasets and triggering. This also adds overhead compared to a simple loop.

That said, four minutes is pretty slow and I filed BEAM-2516 to follow up.

You can also try running it on other backends, and in particular the SparkRunner, FlinkRunner, and ApexRunner support embedded execution on your laptop.

Response to 2017-07-01 Update:

Though the total running time you experience on Cloud Dataflow is ~3 minutes, the actual time taken to process the data is ~1 minute. You can see this in the logs. The rest is spinning up and shutting down worker VMs. We are constantly working to reduce this overhead. Why does it take ~1 minute? You'd have to profile to find out (and I'd love to heard the results!) but certainly Dataflow is doing a lot more than cut: reading and writing from GCS, providing durability and fault tolerance, and in the TextIO writing step it is performing a networked shuffle of your data in order to write in sharded files in parallel. There are obviously things that could be optimized away if Dataflow noticed that your calculation has no parallelism and is small enough that it doesn't need it.

But do remember that Beam and Cloud Dataflow exist to help you use parallel processing on volumes of data that cannot be processed in a timely manner on a single machine. So processing tiny examples with no parallelism available is not a goal.

Minor sequential calculations often do occur as small parts of a large pipeline, but in the context of a realistic physical plan the small auxiliary calculation will often not impact end-to-end time. The overheads of VM management also are a one-time cost so they will more likely be measured against many minutes to hours of computation on dozens to hundreds of machines.

Kenn Knowles
  • 5,838
  • 18
  • 22
  • Thanks, @KennKnowles! I've just updated my questions with details on running the job on Google Dataflow. It's a bit faster but still more that 3 minutes. – Petr Razumov Jul 01 '17 at 14:12