3

My team uses Geode as a makeshift analytics engine. We store a collection of massive raw data objects (200MB+ each) in Geode, but these objects are never directly returned to the client. Instead, we rely heavily on custom function execution to process these data sets inside Geode, and only return the analysis result set.

We have a new requirement to implement two tiers of data analytics precision. The high-precision analytics will require larger raw data sets and more CPU time. It is imperative that these high-precision analyses do not inhibit the low-precision analytics performance in any way. As such, I'm looking for a solution that keeps these data sets isolated to different servers.

I built a POC that keeps each data set in its own region (both are PARTITIONED). These regions are configured to belong to separate Member Groups, then each server is configured to join one of the two groups. I'm able to stand up this cluster locally without issue, and gfsh indicates that everything looks correct: describe member shows each member hosting the expected regions.

My client code configures a ClientCache that points at the cluster's single locator. My function execution command generally looks like the following:

FunctionService
  .onRegion(highPrecisionRegion)
  .setArguments(inputObject)
  .filter(keySet)
  .execute(function);

When I only run the high-precision server, I'm able to execute the function against the high-precision region. When I only run the low-precision server, I'm able to execute the function against the low-precision region. However, when I run both servers and execute the functions one after the other, I invariably get an exception stating that one of the regions cannot be found. See the following Gist for a sample of my code and the exception. https://gist.github.com/dLoewy/c9f695d67f77ec18a7e60a25c4e62b01

TLDR key points:

  1. Using member groups, Region A is on Server 1 and Region B is on Server 2.
  2. These regions must be PARTITIONED in Production.
  3. I need to run a data-dependent function on one of these regions; The client code chooses which.
  4. As-is, my client code always fails to find one of the regions.

Can someone please help me get on track? Is there an entirely different cluster architecture I should be considering? Happy to provide more detail upon request.

Thanks so much for your time!

David

FYI, the following docs pages mention function execution on Member Groups, but give very little detail. The first link describes running data-independent functions on member groups, but doesn't say how, and doesn't say anything about running data-dependent functions on member groups. https://gemfire.docs.pivotal.io/99/geode/developing/function_exec/how_function_execution_works.html https://gemfire.docs.pivotal.io/99/geode/developing/function_exec/function_execution.html

David Loewy
  • 329
  • 1
  • 10

2 Answers2

2

Have you tried creating two different pools on the client, each one targeting a specific server-group, and executing the function as usual with onRegion?, I believe that should do the trick. For further details please have a look at Organizing Servers Into Logical Member Groups.

Hope this helps. Cheers.

Juan Ramos
  • 1,421
  • 1
  • 8
  • 13
  • I'll look into that, thanks! Is there a way to do this multi-pool configuration purely with the Java client API, ie without a `cache.xml`? PS, thanks Juan for answering this and my previous question too! – David Loewy Jan 14 '20 at 17:49
  • 1
    Hey David, it's easier using the `cache.xml` but you can also achieve the same goal through the Java API. You basically need to use `ClientRegionFactory.setPoolName` when creating the client region, and make sure the pool is already defined within the client cache. – Juan Ramos Jan 14 '20 at 18:04
  • 1
    As an example: `clientcache.createClientRegionFactory(ClientRegionShortcut .PROXY).setPoolName("yourPool").create("tag_sketch_high_k")`, and so on. – Juan Ramos Jan 14 '20 at 18:05
  • 1
    To create a pool targeting a specific server group: `Pool pool1 = PoolManager.createFactory().addLocator(locatorHost, locatorPort).setServerGroup("groupName").create("poolName");` – Juan Ramos Jan 14 '20 at 18:07
  • This was exactly what I needed. Both functions are successfully executing sequentially. Thank you! – David Loewy Jan 14 '20 at 20:58
  • No worries, glad to help! – Juan Ramos Jan 15 '20 at 00:13
1

As the region data is not replicated across servers it looks like you need to target the onMembers or onServers methods as well as onRegion.

rupweb
  • 3,052
  • 1
  • 30
  • 57
  • I would expect onRegion() to intelligently route the function execution to servers that host the region, but that doesn't appear to be the case. Furthermore, my use case is data-dependent, which the docs say can only be handled by onRegion(). How do I combine this with onServers()? – David Loewy Jan 14 '20 at 15:35
  • Yup, I don’t know! It makes sense that the data partitions aren’t replicated though. Except yes I did expect a logical view of all data partitions from the locator. I read that it depends on a PartitionResolver for your buckets then came to the onMember and onServer which I guess guide the locator / manage the buckets.... – rupweb Jan 14 '20 at 15:39