0

I'm running into an issue using the ElasticsearchIO.read() to handle more than one instance of a query. My queries are being dynamically built as a PCollection based on an incoming group of values. I'm trying to see how to load the .withQuery() parameter which could provide this capability or any approach that provides flexibility.

The issue is that ElasticsearchIO.read() method expects a PBegin input to start a pipeline, but it seems like I need access outside of a pipeline context somehow. PBegin represents the beginning of a pipeline, and it's required to create a pipeline that can read data from Elasticsearch using IOElasticsearchIO.read().

Can I wrap the ElasticsearchIO.read() call in a Create transform that creates a PCollection with a single element (e.g., PBegin) to simulate the beginning of a pipeline or something similar?

Here is my naive attempt without accepting the reality of PBegin:

   PCollection<String> queries = ... // a PCollection of Elasticsearch queries
    
    PCollection<String> queryResults = queries.apply(
        ParDo.of(new DoFn<String, String>() {
            @ProcessElement
            public void processElement(ProcessContext c) {
                String query = c.element();
                PCollection<String> results = c.pipeline()
                    .apply(ElasticsearchIO.read()
                        .withConnectionConfiguration(
                            ElasticsearchIO.ConnectionConfiguration.create(hosts, indexName))
                        .withQuery(query));
                c.output(results);
            }
        })
    .apply(Flatten.pCollections()));

In general I'm wondering for any of IO-related classes proved by Beam that conforms to PBegin input -- if there is a means to introduce a collection. Here is one approach that might be promising:

// Define a ValueProvider for a List<String>
ValueProvider<List<String>> myListProvider = ValueProvider.StaticValueProvider.of(myList);

// Use the ValueProvider to create a PCollection of Strings
PCollection<String> pcoll = pipeline.apply(Create.ofProvider(myListProvider, ListCoder.of()));
Mazlum Tosun
  • 5,761
  • 1
  • 9
  • 23
  • Please can you share a code snippet ? – Mazlum Tosun Apr 12 '23 at 17:25
  • Sorry please can you move the code snippet in your question instead of a comment, for more readability ? – Mazlum Tosun Apr 12 '23 at 18:02
  • Should I include something else? I'm a newbie so I'm not sure if this is acceptable or if more information is needed. – user21627820 Apr 13 '23 at 19:06
  • No it's perfect thanks. Have you many queries to launch ? – Mazlum Tosun Apr 14 '23 at 08:24
  • It depends on the size of the cohort I'm processing. Rather than being a batch process this is actually part of a streaming process that reads from a PubSubIO subscription. Based on a predetermined batch size and time window the collaboration of SearchHits is provided. – user21627820 Apr 14 '23 at 12:50
  • I was looking at the "Related questions" below, specifically, https://stackoverflow.com/questions/45419994/apply-side-input-to-bigqueryio-read-operation-in-apache-beam?rq=2 It is about 5 years old and mentions the Splittable DoFns becoming available in the future. I'm wondering if this approach has potential. – user21627820 Apr 14 '23 at 13:21
  • I'll guess this question isn't suitable for this forum. I'll try elsewhere. Thanks – user21627820 Apr 21 '23 at 20:07

0 Answers0