I'm very new to Apache Beam and my Java skills are quite low, but I'd like to understand why my simple entries manipulations work so slow with Apache Beam.
What I'm trying to perform is the following: I have a CSV file with 1 million of records (Alexa top 1 million sites) of the following scheme: NUMBER,DOMAIN
(e.g. 1,google.com
), I want to “strip” the first (number) field and get only the domain part. My code for this pipeline is the following:
package misc.examples;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.metrics.Counter;
import org.apache.beam.sdk.metrics.Metrics;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
public class Example {
static class ExtractDomainsFn extends DoFn<String, String> {
private final Counter domains = Metrics.counter(ExtractDomainsFn.class, "domains");
@ProcessElement
public void processElement(ProcessContext c) {
if (c.element().contains(",")) {
domains.inc();
String domain = c.element().split(",")[1];
c.output(domain);
}
}
}
public static void main(String[] args) {
Pipeline p = Pipeline.create();
p.apply("ReadLines", TextIO.read().from("./top-1m.csv"))
.apply("ExtractDomains", ParDo.of(new ExtractDomainsFn()))
.apply("WriteDomains", TextIO.write().to("domains"));
p.run().waitUntilFinish();
}
}
When I execute this code with Maven it takes more than four minutes to succeed on my laptop:
$ mvn compile exec:java -Dexec.mainClass=misc.examples.Example
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building my-example 1.0.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ my-example ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /…/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ my-example ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ my-example ---
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:36 min
[INFO] Finished at: 2017-06-24T15:20:33+03:00
[INFO] Final Memory: 31M/1685M
[INFO] ------------------------------------------------------------------------
While the simple cut(1)
works before you can even blink:
$time cut -d, -f2 top-1m.csv > domains
real 0m0.171s
user 0m0.140s
sys 0m0.028s
So, is such Apache Beam behavior considered acceptable (probably it’d work comparably better on larger amounts of data) or is my code just inefficient?
01-07-2014 Update:
As Kenn Knowles suggested, I've tried to run the pipeline on other runner than the DirectRunner
— on the DataflowRunner
. So the updated code looks like the following:
package misc.examples;
import org.apache.beam.runners.dataflow.DataflowRunner;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
public class Example {
static class ExtractDomainsFn extends DoFn<String, String> {
@ProcessElement
public void processElement(ProcessContext c) {
if (c.element().contains(",")) {
String domain = c.element().split(",")[1];
c.output(domain);
}
}
}
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
DataflowPipelineOptions dataflowOptions = options.as(DataflowPipelineOptions.class);
dataflowOptions.setRunner(DataflowRunner.class);
dataflowOptions.setProject("my-gcp-project-id");
Pipeline p = Pipeline.create(options);
p.apply("ReadLines", TextIO.read().from("gs://my-gcs-bucket/top-1m.csv"))
.apply("ExtractDomains", ParDo.of(new ExtractDomainsFn()))
.apply("WriteDomains", TextIO.write().to("gs://my-gcs-bucket/output/"));
p.run().waitUntilFinish();
}
}
Elapsed time running on Google Dataflow is smaller compared to the Direct runner but still slow enough — a bit more that 3 minutes: