0

I have Regions in GemFire with a large number of records.

I need to lookup elements in those Regions for validation purposes. The lookup is happening for every item we scan; There can be more than 10000 items.

What will be an efficient way to look up element in Regions?

Please suggest.

John Blum
  • 7,381
  • 1
  • 20
  • 30
Vikas Rana
  • 11
  • 4

2 Answers2

1

Vikas-

There are several ways in which you can look up, or fetch multiple elements from a GemFire Region.

  1. As you can see, a GemFire Region indirectly implements java.util.Map, and so provides all the basic Map operations, such as get(key):value, in addition to several other operations that are not available in Map like getAll(Collection keys):Map.

Though, get(key):value is not going to be the most "efficient" method for looking up multiple items at once, but getAll(..) allows you to pass in a Collection of keys for all the values you want returned. Of course, you have to know the keys of all the values you want in advance, so...

  1. You can obtain GemFire's QueryService from the Region by calling region.getRegionService().getQueryService(). The QueryService allows you to write GemFire Queries with OQL (or Object Query Language). See GemFire's User Guide on Querying for more details.

The advantage of using OQL over getAll(keys) is, of course, you do not need to know the keys of all the values you might need to validate up front. If the validation logic is based on some criteria that matches the values that need to be evaluated, you can express this criteria in the OQL Query Predicate.

For example...

SELECT * FROM /People p WHERE p.age >= 21;

To call upon the GemFire QueryService to write the query above, you would...

Region people = cache.getRegion("/People");

...

QueryService queryService = people.getRegionSevice().getQueryService();

Query query = queryService.newQuery("SELECT * FROM /People p WHERE p.age >= $1");

SelectResults<Person> results = (SelectResults<Person>) query.execute(asArray(21));

// process (e.g. validate) the results

OQL Queries can be parameterized and arguments passed to the Query.execute(args:Object[]) method as shown above. When the appropriate Indexes are added to your GemFire Regions, then the performance of your Queries can improve dramatically. See the GemFire User Guide on creating Indexes.

  1. Finally, with GemFire PARTITION Regions especially, where your Region data is partitioned, or "sharded" and distributed across the nodes (GemFire Servers) in the cluster that host the Region of interests (e.g. /People), then you can combine querying with GemFire's Function Execution service to query the data locally (to that node), where the data actually exists (e.g. that shard/bucket of the PARTITION Regioncontaining a subset of the data), rather than bringing the data to you. You can even encapsulate the "validation" logic in the GemFire Function you write.

You will need to use the RegionFunctionContext along with the PartitionRegionHelper to get the local data set of the Region to query. Read the Javadoc of PartitionRegionHelper as it shows the particular example you are looking for in this case.

Spring Data GemFire can help with many of these concerns...

  1. For Querying, you can use the SD Repository abstraction and extension provided in SDG.

  2. For Function Execution, you can use SD GemFire's Function ExeAnnotation support.

Be careful though, using the SD Repository abstraction inside a Function context is not just going to limit the query to the "local" data set of the PARTITION Region. SD Repos always work on the entire data set of the "logical" Region, where the data is necessarily distributed across the nodes in a cluster in a partitioned (sharded) setup.

You should definitely familiarize yourself with GemFire Partitioned Regions.

In summary...

The approach you choose above really depends on several factors, such as, but not limited to:

  1. How you organized the data in the first place (e.g. PARTITION vs. REPLICATE, which refers to the Region's DataPolicy).

  2. How amenable your validation logic is to supplying "criteria" to, say, an OQL Query Predicate to "SELECT" only the Region data you want to validate. Additionally, efficiency might be further improved by applying appropriate Indexing.

  3. How many nodes are in the cluster and how distributed your data is, in which case a Function might be the most advantageous approach... i.e. bring the logic to your data rather than the data to your logic. The later involves selecting the matching data on the nodes where the data resides that could involve several network hops to the nodes containing the data depending on your topology and configuration (i.e. "single-hop access", etc), serializing the data to send over the wire thereby increasing the saturation on your network, and so on and so forth).

  4. Depending on your UC, other factors to consider are your expiration/eviction policies (e.g. whether data has been overflowed to disk), the frequency of the needed validations based on how often the data changes, etc.

Most of the time, it is better to validate the data on the way in and catch errors early. Naturally, as data is updated, you may also need to perform subsequent validations, but that is no substitute for early (as possible) verifications where possible.

There are many factors to consider and the optimal approach is not always apparent, so test and make sure your optimizations and overall approach has the desired effect.

Hope this helps!

Regards, -John

John Blum
  • 7,381
  • 1
  • 20
  • 30
0

Set up the PDX serializer and use the query service to get your element. "Select element from /region where id=xxx". This will return your element field without deserializing the record. Make sure that id is indexed.

There are other ways to validate quickly if your inbound data is streaming rather than a client lookup, such as the Function Service.

Wes Williams
  • 266
  • 1
  • 5