Model source informations to maximize query performance

Question

I am wondering about the best way (in terms of performance) to model data sources in Neo4j.

Consider the following scenario: We are joining different datasets about the music domain in one graph. The data can range from different artists and styles to sales information. Important is to store the source of this information. E.g. do we have the data from a public source like DBpedia or some other private sources. To be able to run queries only on certain datasets we have to include the source to each Node (and in the optimal way to each Relation). Of course one Node or Relation could have multiple sources.

There are three straight forward solutions:

Add a source property to each Node and Relation; index this property and use it in a cypher query. E.g.:

MATCH(n:Artist) WHERE n.source='DBpedia' return n
Add the source as Label to each Node and a Type to each Relation (can we have multiple types on one Relation?). E.g.:

CREATE (n:Artist:DBpediaSource:CustomerSource)
Create a separate Node for each Source and link all other Nodes to the corresponding Source Node. E.g.:

MATCH (n:Artist)-[:HASSOURCE]-(:DBpediaSource) return n

Of course for those examples the solution does not matter in terms of performance. However using the source in more complex queries and on a bigger graph (lets say with a few million Nodes and Relations) the way we model this challenge will have a significant influence on the performance.

One more complex example where the sources are also needed is the generation of a "sub graph". We want to extract all Nodes and Relations from one or multiple Sources and for example export this to a new Neo4j instance, or restrict some graph algorithms such as PageRang to this "sub graph" without creating a separate Neo4j instance.

Does anyone in the community has experience with such a case? What is the best way to model this in terms of performance? Are there maybe other solutions?

Thanks for your help.

This is hard to answer because it really depends on how you are currently modeling the data, and what you are trying to model. There are a slew of gotchas here from source that have conflicting information/localization, to how the data is acquired/used. (Improved read speed general costs write speed) A model that has great performance for one use case could be terrible for another. In general focus on a model that is easy for people to understand, and if you need more performance, hire or become a DB admin for db tuning. As is, we need to know more about the data you are storing. — Tezra, Oct 09 '18 at 16:49
Thanks for your time answering this question: Did I understood you correctly that there is no "nice/approved" way to model such use-case? Even when restricted to high performance read requests? — blackmamba, Oct 10 '18 at 14:38
@blackmamba There is, but it varies depending on what your data is and how you use it. For example, if you are frequently querying the same information that rarely changes, you will want to set up a cache store between your app and neo4j that can cache search results to be reused later. If the result changes nearly every time you run the query though, the cache store is a huge waste of time. — Tezra, Oct 11 '18 at 14:32

Model source informations to maximize query performance

0 Answers0