Spring Data GemFire custom partition and performance

Question

We are using Spring Data GemFire server, client and locator. All of our GemFire PARTITION Regions have complex keys.

For example:

class Key { 
  String id1;
  String id2;
  Date date;
}

We would like to create a custom partition based on this entire key. In the getObject() method we are planning to return a | delimited string of these 3 fields.

Is this is a best practice or is there any other way to return the object?

We are also planning to create key indexes and in this case we will have to create indexes individually on Key.id1 and Key.id2 and Key.date as our searches will based on the key dates and key id1, id2.

Is this a right way to create the key index for improving the performance?

Based on GemFire documentation, we are planning to use Functions to improve the performance. In the Filter argument for search to happen in specific partition

Do we just need to send the complex object or whatever partition logic we have added in getObject passed in the filter set?

John Blum · Answer 1 · 2021-04-01T18:34:52.577

1

First of all, this problem is independent of whether you started your GemFire (data) servers using Spring Data GemFire (SDG) or not, such as by using Gfsh. Having said that, there are significant advantages to using Spring, and specifically SDG, to bootstrap and configure your servers, Locators, and clients. But, I simply wanted to make this distinction where this problem is concerned for other interested readers.

By getObject() method, I assume your are actually referring to PartitionResolver.getRoutingObject()? See Javadoc.

In general, I'd say it is nearly always preferable to use simple, scalar types as keys in your Regions, such as Long, Integer, String, etc. Most searching should be based on the value, or properties of the value (i.e. Object) rather than individual components (e.g. id1) of the key.

Additionally, I will also point out that I disagree with the PartitionResolver Javadoc, bullet #1, where it states, "The key class can implement the PartitionResolver interface to enable custom partitioning". I think this is a naive approach for many reasons, not the least of which is it couples your key class to GemFire. You should always prefer #2 when a PartitionResolver is needed.

But is a PartitionResolver actually needed in your case?

Since your "entire" key defines the "route" (i.e. all properties [ id1, id2, date ] of the Key class), you don't even really need to involve a custom PartitionResolver at all.

All you simply need to do is provide a proper implementation of the Object equals(:Object) and hashCode() methods in your Key class.

TIP: Keep in mind that GemFire Regions at a basic, fundamental level, are simply a java.util.Map, key-value data structure. Yes, they are distributed (in most cases) as well as partitioned for the PARTITION Regions, but it is fundamentally based on a Map and the "hash" of your key. If your entire key defines the partition (or route), then no custom PartitionResolver is necessary.

TIP: Furthermore, a PARTITION Region is a logical Region that is divided up into 113 buckets (by default, ignoring primaries & secondaries for a moment) and those buckets are distributed across the (data-hosting) servers in your cluster, making the Region physically dispersed, of course, assuming your servers are individual processes on separate machines. This is what constitutes a "logical" Region, because to your application, it is simply 1 wholistic data structure. Anyway.

You would implement a custom PartitionResolver if a portion of the key was used to determine the partition (or route) or the key/value pairing. This is useful if you want to group certain key/value pairings together, at the same physical location (i.e. server/process & machine in the cluster).

For example, suppose you want to group similar key/value pairings based on the date of your key. Then...

class KeyDatePartitionResolver implements PartitionResolver { 

  public String getName() {
    return getClass().getName();
  }

  public Object getRoutingObject(EntryOperation<Key, Object> entryOp) {
    Key key = entryOp.getKey();
    return key.getDate();
  }
}

Now all entries (key/values) that occurred on a similar date/time would be routed to the same partition (or bucket) in the logical PARTITION Region. Of course, you could further filter the date to group, or route the key/value pairings based on year/month/day or simply year/month, however you choose. Again, all that matters is that the Object returned from the getRoutingObject(..) method in your custom PartitionResolver implements the equals(:Object) and hashCode() methods. Obviously, Java's java.util.Date class (Javadoc) does.

Regarding...

"Is this a right way to create the key index for improving the performance?"

Well, it depends on your application search cases. Are your search cases for certain values based on the components (i.e. [ id1, id2, date ]) of the key collectively or individually?

For example, if you search by the combinations [ id1, date ] as well as [ id2, date ] then you would create 2 (KEY) Indexes with these fields from the Key class. If you searched by all 3 fields [ id1, id2, date ], then your (KEY) Index would include all 3 fields. If you searched by all 3 combinations, when you would (generally) need all 3 KEY Indexes for optimal performance.

Essentially, a field or combination of fields used in a query predicate expression should be indexed for potentially more optimal performance.

There is no guarantee though, either. Remember, when values change (are added, updated, removed, etc) Indexes need to be updated to some degree. Therefore, there are "maintenance costs" associated with Indexes and the more you have, the more it can potentially cost.

You also have to weigh the benefit between the number of key/value pairings and whether a Index is warranted at all. If the data is mostly referential in nature, with a relatively small data set (e.g. < 1000 entries, perhaps), then sometimes a full scan can still be more efficient in performance than when using Index. A full scan is equivalent to a full table scan in an RDBMS. Just remember, Indexes are not free. They take up space (memory) and time (CPU) to maintain.

I'd also say, it is generally better to (again) use simple keys and maintain "searchable" state in the values associated with the keys. This boils downs to design preference, though. Use (simple) keys for partitioning/routing.

For additional (and relevant) information, see: here, here, here, and here.

Lastly, regarding Functions, the filter is a set of "keys" (Javadoc). The keys are used to find, or route to the (bucket of the) partition in the logical, PARTITION Region.

If you also configured a custom PartitionResolver with the PARTITION Region, I believe it will also apply the resolver to the filtered (or set of keys) passed to the Function when the Function is executed.

But, you are simply passing the entire key, which in your case is an instance of your Key class, where you can pass multiple instances (hence, the "Set") depending on which keys you want to filter by.

Anyway, I hope this all makes sense.

As always, when these sort of questions or asked, it varies significantly based on your UC (or data access patterns), requirements, data set. The proper thing to do here is try things and test.

Good luck!

edited Apr 01 '21 at 18:34

answered Feb 12 '21 at 23:57

John Blum

7,381
1
20
30

Thanks John . We have a requirement where the searches from our application will be partial key searches ( for e.g if there are 4 fields in a key , the search would be on only 3 fields in a key , In this case should we still be using complex object as key or does it make sense to use a string object as key . – Vaidy Apr 01 '21 at 04:36
I have seen it done both ways, but typically, users use Strings, where the String contains 3 out of the 4 values in a predefined order to make the routing consistent, since again, routing is based on the `hashCode` of the routing object. Personally, I like a complex type as it is easier to understand (e.g. self-describing), document and maintain. – John Blum Apr 01 '21 at 18:38
Complex Type the issue is we need to make sure they are not being serialized . I have been trying to use a MappingPDXSerializer with pdx read serailized to true and i want to just ignore the complex type from getting serialized but still it gets serialized . I tried to use a transient key word for the key , when i do a select key from /region i dont even see that value . What is the best approacht to make sure the complext type key doesnt get serialized . Also tried MappingPDXSerializer's excludeTypes to ignore the field but still it doesnt work. – Vaidy Apr 01 '21 at 19:06
Not sure I am exactly following you... are you saying you have Region "value" such as `class ComplexDomainType { @Id ComplexKeyType key }` where an instance of `ComplexKeyType` is the Region "key"? I guess either way it does not matter, because the key (or `ComplexKeyType`) will be used to store the instance of `ComplexDomainType` in the Region to which the `ComplexDomainType` is mapped. So, the key and value must always be serializable, particularly between client and server. There is no exceptions in this case. – John Blum Apr 01 '21 at 19:56
I guess, using your earlier example above, in your original question, your situation might possibly be `class ComplexDomainType { @Id Key key; }` where the Region would be defined as `Region`. Both the `Key` class and `ComplextDomainType` must be serializable (e.g. Java Serializable (which a String-based key would be), PDX serializable or `DataSerializable` (alternatively, using a `DataSerializer`). Both keys and values stored in the Region must be serializable, even if not using a client/server topology... – John Blum Apr 01 '21 at 20:00
Region Keys/Values will be serialized between peers in a cluster as well as possibly over WAN if you are using GemFire/Geode's WAN topology. If data is overflowed/persisted to disk, the data must be serializable. Therefore, keys/values almost nearly always need to be serializable in some capacity. – John Blum Apr 01 '21 at 20:00
In general, if you don't want some property of the Region's "value" (e.g. `ComplexDomainType`) to be serialized, then you can use Spring Data's `@Transient` annotation on the object field or property. But, you cannot say an entire key or value cannot be serialized. That won't work, but I suspect after reading your comment a few times, that is not what you are implying, either. Hence, my confusion; sorry. – John Blum Apr 01 '21 at 20:03
CLARIFICATION on the comment beginning ("_Region Keys/Values will be serialized..._" Region Keys/Values will be serialized between peers in a cluster as well as possibly over WAN if you are using GemFire/Geode's WAN topology. If data is overflowed/persisted to disk, the data must be serializable. Therefore, keys/values almost always need to be serializable in some capacity. LOCAL Regions without overflow/persistence maybe the only exception, and even then, I am not entirely certain. – John Blum Apr 01 '21 at 20:06
In the gemfire website , it was mentioned that using PDX objects as region entry keys is highly discouraged, it's in your best interest to entirely avoid that approach, trust us. That said, if you still want to go down this path, you can use the MappingPdxSerializer.excludeTypeFilters [1] to configure classes that shouldn't be handled by the MappingPdxSerializer. I'd suggest going through Spring Data GemFire's examples – Vaidy Apr 01 '21 at 20:34
This is the link https://gemfire.docs.pivotal.io/99/geode/developing/data_serialization/using_pdx_region_entry_keys.html,It says "The best practice for creating region entry keys is to use a simple key; for example, use a String or Integer. If the key must be a domain class, then you should use a non-PDX-serialized class. Going by what has mentioned here i thought if i use a domain object as a key in my class it shouldnt be serialized . Is my understanding correct ? – Vaidy Apr 01 '21 at 20:59
No. The only time a value (field/property) of a domain object is not serialized, whether a primitive/wrapper type or a complex type is when the field or properly is marked with the SD `@Transient` annotation. Keys are always serialized since it is used to map to the value (e.g. domain object in the Region). – John Blum Apr 02 '21 at 19:02
By way of example... if you have a domain class like `@Region("People") class Person { @Id Long id }` (along with other fields/properties), then if you store a person instance in the "People" Region, for example by using `PersonRepository.save(:Person)`, then this ultimately results in a call to `PeopleRegion.put(id, person)` where `id` is the "id" field of the `Person` class. So, it really does not matter if you are using PDX serialization or even mark the `id` field/property as transient, it will/must get serialized. Does this make sense? – John Blum Apr 02 '21 at 19:05
Yes got it . I have another question , If i make use of maapingPDXSerializer and when i try to use a domain object as a key , do i need to use excludeTypeFilters and exclude the domain objects . I see exceptins as below - An IOException was thrown while deserializing; nested exception is org.apache.geode.cache.client.ServerOperationException: – Vaidy Apr 05 '21 at 15:23
Also on a parallel note , i wanted to check if partial key based search will be faster than value based search thats one of the main reason for adding the domain object as key. For e.g i have a region class like class Region{Key key , string id , String value } and Key class has 2 variables id and value , will key.id= and key.value= in searches will be faster than id= and value= . Will it make any difference ? – Vaidy Apr 05 '21 at 15:58
Regarding your first comment... Well, it seems the client serialized the domain object (value) successfully, but the server-side (hence, `ServerOperationException`) could not deserialize (thus, `IOException was thrown while deserializing`) the value. The server-side should not normally cause a deserialization unless a OQL query or Function execution accesses the stored value in some non-arbitrary way. For example, non-field based derived values would cause a deserialization. Think of a `age` property on a `Person` class that is derived from the `birthdate` field. ... – John Blum Apr 05 '21 at 17:57
If `age` is used in a query (e.g. `SELECT * FROM /People WHERE age >= 21`) then that will cause the value to be deserialized because there is no `age` field on the `Person` class (e.g. `age` is derived from `birthdate`). The OQL query engine must access the `age` property by invoking the `getAge()` method, which causes a deserialization. If the `Person` class does not have a public, no-arg constructor then GemFire/Geode has problems deserializing those PDX values back to object form, unless you configured SDG's `MappingPdxSerializer` on the server-side. This is uncommon ... – John Blum Apr 05 '21 at 18:01
... if you did not bootstrap and configure the server-side cluster with Spring. In fact, GemFire/Geode PDX serialization has a multitude of problems when trying to handle complex use cases, with complex domain objects. It's limitations become apparent quite quickly. SDG's `MappingPdxSerializer` alleviates some, but not all of these problems. Honestly, it is hard to say what your exact problem is in this case since you did not provide the entire Stack Trace; I'd look at the last 2 `caused by`. Feel free to share. – John Blum Apr 05 '21 at 18:07
Regarding your second comment... It depends. It depends on many factors such as, but not limited to, on how many nodes are in the cluster, whether you configured redundancy, how many values are stored in the Region (PR I assume), did you configure single-hop. I mean, a partial key and a properly crafted and efficient `PartitionResolver` only leads the `get(key)` request to the proper server. However, you could probably achieve a similar effect using a Function with a (key) Filter to query the PR providing the proper Indexes matching the query predicated were created. – John Blum Apr 05 '21 at 18:12
There are all sorts of factors to consider. You also have to consider the maintenance cost of each approach. I'd opt for the simplest approach, that is easy to maintain that also gives acceptable performance and adjust as needed, e.g. as demand grows. – John Blum Apr 05 '21 at 18:13
I will get the stacktrace . I have a class with @Id as Domain Object class region {@Id Key key } . I m using mapping PDXSerrializer to serialize. All Im doing is creating a new instance of MappingPDXSerializer and passing it to the serializer method of cacheFactory . Do i need to include or exclude Filter types by passing the key object domain class ? In the client im using EnablePDX . This setup works till the lower environment but fails in production where we have more servers . – Vaidy Apr 05 '21 at 19:04
This is failing in all the getByKey methods and also put methods on the server . This is not failing all the times , It fails intermittently . – Vaidy Apr 05 '21 at 19:05
Do i need to make sure the domain key class and any other region annotated class toimplement serializable . Currently they are not . – Vaidy Apr 05 '21 at 19:52
Yes. Your key class must be serializable in some fashion. Of course, GemFire/Geode will attempt to serialize it as PDX if you do not exclude it from `MappingPdxSerializer`. I would add and exclude for it and make the your custom `Key` class `java.io.Serializable`. Again, all of this might be simpler if you use a primitive/Wrapper type returned by your custom `PartitionResolver` if you are still going down that path. I know I said I prefer a complex, encapsulated type in most cases, but this might warrant the opposite. – John Blum Apr 06 '21 at 00:07
Just remember, anything that goes over the wire, is overflow/persisted to disk, etc, etc, must be serializable in some form (i.e. Java Serialized bytes, GemFire/Geode PDX or Data Serialization bytes). This includes keys and values. I don't recall of the top of my head whether keys get passed to the configured `PdxSerializer` of the cache (e.g. SDG's `MappingPdxSerializer`). I suspect they do. In any case, the key, like the value, must be serialize when the "entry" is sent from the client to the server/cluster. – John Blum Apr 06 '21 at 00:08
in my case what is happening is both the keys and the values aregetting serialized . I havent excluded anything from the mappingpdxserializer . Do i need to exclude the key class from mappingpdx serializer and make it explicitly implement serializable interface ? – Vaidy Apr 06 '21 at 00:26
Also , the same region contains list of domain objects as an attribute , This also gets serialized . Do i need to excludethem also from MappingPDXSerializer – Vaidy Apr 06 '21 at 00:28
Given your key is a complex type, then I would probably exclude it form the `MappingPdxSerializer` unless you have OQL queries that query part of the key. Querying parts of a key is highly uncommon, but crazier things have happened. Yes, if you exclude the key from the `MappingPdxSerializer`, then the key must implement `java.io.Serializable`. It has to be serializable in some form. The list of domain objects as a field/property of an enclosing domain object should not be a problem. – John Blum Apr 06 '21 at 16:59
Thank you . We are following the same . Only question we are querying parts of the key in our OQL queries as its a domain object , Will i cause performance issues ? – Vaidy Apr 06 '21 at 18:31
Yes, possibly. If the key is not serialized as PDX (which is not recommended by GemFire/Geode, I think) then it will deserialize (at least the key object) everytime you query it. PDX is the only format you can query without causing general deserialization, unless you access a property not derived from an object field, such as in my `Person` class `birthdate` (field-based) and `age` (derived/calculated value based on `birthdate`) example. – John Blum Apr 06 '21 at 19:13
This is little bit confusing . We are using domain key and values and we are using mappingPDXSerializer , Implicitly both the keys and values would get serialized to PDX right . In this case should we exclude the key object not to be get serialized as PDX by mapping pdx serializer ? – Vaidy Apr 06 '21 at 19:22
Yes, both the key and value would get serialized to PDX when using SDG's `MappingPdxSerializer` if 1) the key is a complex object (which you say it is, and illustrated in your question in this SO post with the `Key` class above) and 2) the `Key` object is a key on the app domain object that is stored in a Region as a value. For example, using your `Key` class above as the "id" for the value, or app domain object, we would have `@Region("Test") class DomainObject { @Id Key identifier; ... }` which would be stored in the "Test" Region`. Make sense? – John Blum Apr 07 '21 at 03:52
If you are writing OQL queries where you are querying parts of the `Key`, say `id1` and `date` or `id2` and `date` or any of the `Key` fields individually (e.g. just `date`) then it would be better if `Key` object on the `DomainObject` instance (value) stored in the Region were PDX, to prevent a deserialization. However, if you are just looking up the value using `Region.get(key)` then it makes little difference, since it will use the `Key` instance to compute a hashCode to lookup the value (i.e. `DomainObject`) in the "Test" Region. – John Blum Apr 07 '21 at 03:55
So, it depends on how you are using the keys and accessing the keys/values in the Region. Ideally, the "key" is a primitive type (e.g. `String`) and all access (e.g. queries, or even simple gets/puts) are on the Region value, which is stored as PDX. This is the most optimal way to use GemFire/Geode. All information that is queryable/accessible is in the value, not the key. The key is just the identifier used to (quickly) lookup (i.e. get) or route values on put when stored. – John Blum Apr 07 '21 at 03:59
Thanks for the clarification . I will try to use primtive type for keys , In our case most of the OQL will be based out of values and there would be millions of records thats being stored and our regions are partitioned ones , In most of the case we would not know the key completely , so the filter would be only on the values and not on keys , In such cases would be ideal to use key based or (value based search with indexes) for best performance from gemfire . – Vaidy Apr 07 '21 at 13:30
Another question , If i go with the above case where i will have a string as key , If im not going to use functions to query does using custom partitioning help ? For e.g if i have a pipe delimited String key with id and date and have just a custom partition by date , I dont use functions will it really help in any performance as most of my OQL queries would be on the values and not on the key . – Vaidy Apr 07 '21 at 18:03
Not really, no. When querying a PR, you want to do most of the querying/work inside of a Function and operate on local data set (bucket) of the (logical) PR. See the docs (https://geode.apache.org/docs/guide/113/developing/querying_basics/querying_partitioned_regions.html) for more details. – John Blum Apr 07 '21 at 19:31
Should we always be using function to query on a partiioned regions if we are searching by values in the OQL and not on keys ? – Vaidy Apr 07 '21 at 21:26
Most of the time, yes. – John Blum Apr 07 '21 at 23:46
I have a region which has a list of domain object and those domain object also has a list and in addition to that region has additional attributes as values , I wanted to update a single value in the region , As the list of domain object is huge , When i do a gemfirerepository.get(string) or try to do a gemfirerepository.save() it hangs because its trying to serialize the list of objects . Is there a better to way to update just a column in gemfire or if not what is the best way to handle this situation ? – Vaidy Apr 13 '21 at 15:47
Well, you could break up your domain model a bit and store the List of objects independently in a separate Region (that you could even colocate for querying purposes). You might be able to handle this UC via GemFire/Geode Data Serialization and Deltas, though I would need to think on that one more, and probably write some tests, which you could, of course, do yourself. There might also be other ways, but these were the primary ways I could think of, off the top of my head. – John Blum Apr 13 '21 at 23:11
If i have a region Employee and another region Salaries , If i co-locate salaries with Employee region . How would i query both the regions if both regions are partitioned regions . If i query on Salaries will i be able to get all the fields from the employee region and vice versa. – Vaidy Apr 14 '21 at 20:00
https://geode.apache.org/docs/guide/113/developing/partitioned_regions/join_query_partitioned_regions.html – John Blum Apr 14 '21 at 20:05
I tried with spring data gemfire function annotations but i get an error stating , join queries on partition region can run only on functions . Here is the complete snippet https://community.pivotal.io/s/question/0D54y00006E78LXCAZ/spring-data-gemfire-partition-region-equi-join-queries – Vaidy Apr 16 '21 at 13:47
Also in the ClientCacheApplication , if i have 2 different interfaces with @OnRegion annotated , the client doenst connect to the locator at all but if there's only one OnRegion it works fine . – Vaidy Apr 16 '21 at 13:49

Spring Data GemFire custom partition and performance

1 Answers1