2

I am trying to enhance data in a pipeline by querying Datastore in a DoFn step. A field from an object from the Class CustomClass is used to do a query against a Datastore table and the returned values are used to enhance the object.

The code looks like this:

public class EnhanceWithDataStore extends DoFn<CustomClass, CustomClass> {

private static Datastore datastore = DatastoreOptions.defaultInstance().service();
private static KeyFactory articleKeyFactory = datastore.newKeyFactory().kind("article");

@Override
public void processElement(ProcessContext c) throws Exception {

    CustomClass event = c.element();

    Entity article = datastore.get(articleKeyFactory.newKey(event.getArticleId()));

    String articleName = "";
    try{
        articleName = article.getString("articleName");         
    } catch(Exception e) {}

    CustomClass enhanced = new CustomClass(event);
    enhanced.setArticleName(articleName);

    c.output(enhanced);
}

When it is run locally, this is fast, but when it is run in the cloud, this step slows down the pipeline significantly. What's causing this? Is there any workaround or better way to do this?

A picture of the pipeline can be found here (the last step is the enhancing step): pipeline architecture

Matthias Baetens
  • 1,432
  • 11
  • 18
  • If you are willing to share a job id, we can take a look directly. – Kenn Knowles Oct 14 '16 at 20:33
  • Hi Kenn, The Job ID is: 2016-10-14_10_34_39-5525093815482139851. Thanks for looking into this. The code does look fine (is this the best practice to query Datastore from Dataflow)? – Matthias Baetens Oct 15 '16 at 20:28
  • After taking another look, I have provided what I think is the best answer to start from, before trying to debug at a lower level. – Kenn Knowles Oct 16 '16 at 23:52

2 Answers2

5

What you are doing here is a join between your input PCollection<CustomClass> and the enhancements in Datastore.

For each partition of your PCollection the calls to Datastore are going to be single-threaded, hence incur a lot of latency. I would expect this to be slow in the DirectPipelineRunner and InProcessPipelineRunner as well. With autoscaling and dynamic work rebalancing, you should see parallelism when running on the Dataflow service unless something about the structure of your causes us to optimize it poorly, so you can try increasing --maxNumWorkers. But you still won't benefit from bulk operations.

It is probably better to express this join within your pipeline, using DatastoreIO.readFrom(...) followed by a CoGroupByKey transform. In this way, Dataflow will do a bulk parallel read of all the enhancements and use the efficient GroupByKey machinery to line them up with the events.

// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);

// Key them both by the common id
PCollection<KV<Long, CustomClass>> keyedEvents =
    events.apply(WithKeys.of(event -> event.getArticleId()))

PCollection<KV<Long, Entity>> =
    articles.apply(WithKeys.of(article -> article.getKey().getId())

// Set up the join by giving tags to each collection
TupleTag<CustomClass> eventTag = new TupleTag<CustomClass>() {};
TupleTag<Entity> articleTag = new TupleTag<Entity>() {};
KeyedPCollectionTuple<Long> coGbkInput =
    KeyedPCollectionTuple
        .of(eventTag, keyedEvents)
        .and(articleTag, keyedArticles);

PCollection<CustomClass> enhancedEvents = coGbkInput
    .apply(CoGroupByKey.create())
    .apply(MapElements.via(CoGbkResult joinResult -> {
      for (CustomClass event : joinResult.getAll(eventTag)) {
        String articleName;
        try {
          articleName = joinResult.getOnly(articleTag).getString("articleName");
        } catch(Exception e) {
          articleName = "";
        }
        CustomClass enhanced = new CustomClass(event);
        enhanced.setArticleName(articleName);
        return enhanced;
      }
    });

Another possibility, if there are very few enough articles to store the lookup in memory, is to use DatastoreIO.readFrom(...) and then read them all as a map side input via View.asMap() and look them up in a local table.

// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);

// Key the articles and create a map view
PCollectionView<Map<Long, Entity>> = articleView
    .apply(WithKeys.of(article -> article.getKey().getId())
    .apply(View.asMap());

// Do a lookup join by side input to a ParDo
PCollection<CustomClass> enhanced = events
    .apply(ParDo.withSideInputs(articles).of(new DoFn<CustomClass, CustomClass>() {
      @Override
      public void processElement(ProcessContext c) {
        Map<Long, Entity> articleLookup = c.sideInput(articleView);
        String articleName;
        try {
          articleName =
              articleLookup.get(event.getArticleId()).getString("articleName");
        } catch(Exception e) {
          articleName = "";
        }
        CustomClass enhanced = new CustomClass(event);
        enhanced.setArticleName(articleName);
        return enhanced;
      }
    });

Depending on your data, either of these may be a better choice.

Kenn Knowles
  • 5,838
  • 18
  • 22
  • Hey Kenn, Thanks for your answer. I'll give it a try soon. The only problem I see with this solution is that if the Datastore gets updated (we might write to it while running the pipeline), these changes won't be available in the pipeline? Is there a way around this (without re-running the pipeline, let's say we run it in streaming mode)? – Matthias Baetens Oct 17 '16 at 10:24
  • You are correct - if it gets updated while the pipeline is running, you will get a particular snapshot. You cannot currently address this with a streaming pipeline, because there is no Datastore API for reading changes. – Kenn Knowles Oct 17 '16 at 18:00
  • Ok thanks for the update. Is there another solution where you could store larger metadata datasets (that do not fit in-memory) and which are dynamically updated while the Pipeline is running? (not necessary Datastore) – Matthias Baetens Oct 17 '16 at 22:53
  • We managed to pinpoint the issue: the project is located in the EU - so the Datastore is by default in the same location; while Dataflow jobs are hosted in the US by default (I did not overwrite this option). Sorry to bother you with this (probably trivial) issue - guess you have to learn this the hard way. FYI: it performs between 25-30 times faster being in the same location; ~40 elements/s compared to ~1200 elements/s for 15 workers. – Matthias Baetens Oct 18 '16 at 23:23
  • I suggest writing your comment as a supplementary answer - I think folks will get value out of it. – Kenn Knowles Oct 19 '16 at 02:16
  • Done! Thanks for your follow-up :) – Matthias Baetens Oct 19 '16 at 08:52
4

After some checking I managed to pinpoint the problem: the project is located in the EU (and as such, the Datastore is located in the EU-zone; same as the AppEningine zone), while the Dataflow jobs themselves (and thus the workers) are hosted in the US by default (when not overwriting the zone-option).

The difference in performance is 25-30 fold: ~40 elements/s compared to ~1200 elements/s for 15 workers.

Matthias Baetens
  • 1,432
  • 11
  • 18