Data model design guide lines with GEODE

Question

We are soon going to start something with GEODE regarding reference data. I would like to get some guide lines for the same.

As you know in financial reference data world there exists complex relationships between various reference data entities like Instrument, Account, Client etc. which might be available in database as 3NF.

If my queries are mostly read intensive which requires joins across tables (2-5 tables), what's the best way to deal with the same with in memory grid?

Case 1: Separate regions for all tables in your database and then do a similar join using OQL as you do in database?

Even if you do so, you will have to design it with solid care that related entities are always co-located within same partition.

Modeling 1-to-many and many-many relationship using object graph?

Case 2: If you know how your join queries look like, create a view model per join query having equi join characteristics.

Confusion:

(1) I have 1 join query requiring Employee,Department using emp.deptId = dept.deptId [OK fantastic 1 region with such view model exists]

(2) I have another join query requiring, Employee, Department, Salary, Address joins to address different requirement

So again I have to create a view model to address (2) which will contain similar Employee and Department data as (1). This may soon reach to memory threshold.

Changes in database can still be managed by event listeners, but what's the recommendations for that?

Thanks, Dharam

John Blum · Answer 1 · 2016-06-20T20:36:34.127

I think your general question is pretty broad and there isn't just one recommended approach to cover all UCs (primarily all your analytical views/models of your data as required by your application(s)).

Such questions involve many factors, such as the size of individual data elements, the volume of data, the frequency of access or access patterns originating from the application or applications, the timely delivery of information, how accurate the data needs to be, the size of your cluster, the physical resources of each (virtual) machine, and so on. Thus, any given approach will undoubtedly require application tuning, tuning GemFire accordingly and JVM tuning regardless of your data model. Still, a carefully crafted data model can determine the extent of such tuning.

In GemFire specifically, such tuning will involve different configuration such as, but not limited to: data management policies, eviction (Overflow) and expiration (LRU, or perhaps custom) settings along with different eviction/expiration thresholds, maybe storing data in Off-Heap memory, employing different partition strategies (PartitionResolver), and so on and so forth.

For example, if your Address information is relatively static, unchanging (i.e. actual "reference" data) then you might consider storing Address data in a REPLICATE Region. Data that is written to frequently (typically "transactional" data) is better off in a PARTITION Region.

Of course, as you know, any PARTITION data (managed in separate Regions) you "join" in a query (using OQL) must be collocated. GemFire/Geode does not currently support distributed joins.

Additionally, certain nodes could host certain Regions, thus dividing your cluster into "transactional" vs. "analytical" nodes, where the analytical-based nodes are updated from CacheListeners on Regions in transactional nodes (be careful of this), or perhaps better yet, asynchronously using an AEQ with AsyncEventListeners. AEQs can be separately made highly available and durable as well. This transactional vs analytical approach is the basis for CQRS.

The size of your data is also impacted by the form in which it is stored, i.e. serialized vs. not serialized, and GemFire's proprietary serialization format (PDX) is quite optimal compared with Java Serialization. It all depends on how "portable" your data needs to be and whether you can keep your data in serialized form.

Also, you might consider how expensive it is to join the data on-the-fly. Meaning, if your are able to aggregate, transform and enrich data at runtime relatively cheaply (compute vs. memory/storage), then you might consider using GemFire's Function Execution service, bringing your logic to the data rather than the data to your logic (the fundamental basis of MapReduce).

You should know, and I am sure you are aware, GemFire is a Key-Value store, therefore mapping a complex object graph into separate Regions is not a trivial problem. Dividing objects up by references (especially many-to-many) and knowing exactly when to eagerly vs. lazily load them is an overloaded problem, especially in a distributed, replicated data store such as GemFire where consistency and availability tradeoffs exist.

There are different APIs and frameworks to simplify persistence and querying with GemFire. One of the more notable approaches is Spring Data GemFire's extension of Spring Data Commons Repository abstraction.

It also might be a matter of using the right data model for the job. If you have very complex data relationships, then perhaps creating analytical models using a graph database (such as Neo4j) would be a simpler option. Spring also provides great support for Neo4j, led by the Neo4j team.

No doubt any design choice you make will undoubtedly involve a hybrid approach. Often times the path is not clear since it really "depends" (i.e. depends on the application and data access patterns, load, all that).

But one thing is for certain, make sure you have a good cursory knowledge and understanding of the underlying data store and it' data management capabilities, particularly as it pertains to consistency and availability, beginning with this.

Note, there is also a GemFire slack channel as well as a Apache DEV mailing list you can use to reach out to the GemFire experts and community of (advanced) GemFire/Geode users if you have more specific problems as you proceed down this architectural design path.

Data model design guide lines with GEODE

1 Answers1