0

I'm new to Neo4j. I'm trying to create a monopartite projection from a bipartite graph. I've only got two types of nodes:

  • Post nodes (green): These are all pieces of content, such as tweet, reddit post, news article, etc.
  • Entity nodes (brown): These are the entities associated with the content

enter image description here

My challenge is that I have a handful of different relationships. Some examples:

  • (e1:Entity)-[r:TWEETED]->(p:Post)->[r:AT_MENTIONED]->(e2:Entity)
  • (e1:Entity)-[r:TWEETED]->(p1:Post)-->[r:QUOTE_TWEETED]->(p2:Post)<-[r:TWEETED]<-(e2:Entity)
  • (e1:Entity) -[r:PUBLISHED]->(p:Post)-[r:MENTIONS]->(e2:entity)

What I'm trying to do is

  1. Change this to a monopartite graph projection that has only the entities but infers a RELATED_TO edge based on all types of relations, not just a single type of relationship and
  2. Assigns an edge weight based on the number of times two entities co-occur.

In other words, using the examples above:

Example 1

  • Before: (e1:Entity)-[r:TWEETED]->(p:Post)->[r:AT_MENTIONED]->(e2:Entity)
  • After: (e1:Entity) -[r:RELATED_TO]-(e2:Entity)

Example 2

  • Before: (e1:Entity)-[r:TWEETED]->(p1:Post)-->[r:QUOTE_TWEETED]->(p2:Post)<-[r:TWEETED]<-(e2:Entity)
  • After: (e1:Entity) -[r:RELATED_TO]-(e2:Entity)

Example 3

  • Before: (e1:Entity)-[r:PUBLISHED]->(p:Post)-[r:MENTIONS]->(e2:entity)
  • After: (e1:Entity) -[r:RELATED_TO]-(e2:entity)

I can find examples online that convert only one type of relationship to a monopartite but can't seem to get anything to work for multiple relationship or relationships that have an intervening node of a different type (i.e. two post nodes between an entity node). I've done the graph data science training and couldn't find exactly what I was looking for there either.

Any advice?

CowCookie
  • 51
  • 4

2 Answers2

1

Does this query work for you?

MATCH (e1:Entity)-[*2..3]-(e2:Entity)
WHERE id(e1) < id(e2)
WITH e1, e2, count(*) as strength
MERGE (e1)-[r:RELATED_TO]->(e2) 
SET r.strength = strength

Since we don't specify a type for the relationship between e1 and e2, any relationship in the graph will match. There can be between two and three relationships in the pattern, which would translate to one or two Post nodes between the Entity nodes.

I assume that the direction of the relationships doesn't matter, so I left off the direction on the relationship arrows. I required the node id for e1 to be less than the node id for e2 to avoid creating the RELATED_TO relationship in both directions.

If you need to look for paths longer than 3 relationships in the schema you described, you could consider using the apoc path expander to search for Entity-to-Entity paths with only Post nodes between.

Nathan Smith
  • 881
  • 4
  • 6
  • For some reason it's not. I'm doing this because I want to have a projection I can call up again ... CALL gds.graph.create('entities', 'Entity', MATCH (e1:Entity)--[*2..3]--(e2:Entity) WHERE id(e1) < id(e2) WITH e1, e2, count(*) as strength MERGE (e1)-[r:RELATED_TO]->(e2) SET r.strength = strength) But the wildcard is not working. I get this error ... Invalid input '*': expected whitespace, a variable, RelationshipsPattern, an expression or ']' (line 1, column 65 (offset: 64)) "CALL gds.graph.create('entities', 'Entity', MATCH (e1:Entity)--[*2..3]--(e2:Entity)" – CowCookie Nov 26 '21 at 22:23
  • I should add that the query works when used by itself. But it appears to be doing the actions on the graph itself instead of creating a projection (although as a newbie to Neo4j, perhaps I'm mistaken). – CowCookie Nov 26 '21 at 22:42
  • Never mind, I got it. I was having some memory problems and couldn't get it to work even with the apoc.periodic.iterate. I wound up just breaking it down into batches: 1) Create a new type of relationship like you suggest and then in a separate query count up all those relationships. I'm sure there are cleaner ways to do it. But since my storage exceeds my RAM and I'm not well versed on iterate, this got the job done. Your tip gave me a HUGE starting point! Thanks! – CowCookie Nov 27 '21 at 04:22
  • I'm really glad it helped. It sounds like you're doing some interesting work. If you ever want to join us remotely for Kansas City Graph Database Meetup, you'd be welcome. https://www.meetup.com/Kansas-City-Graph-Databases-Meetup-Group/events/282255011/ – Nathan Smith Nov 28 '21 at 01:55
0

It is not the complete solution, but exclusively for examples 1 and 3 you may consider doing so-called one-mode projection using e.g. python.
The whole point why I'm suggesting it is because (from my experience) such operations in Neo4j can become really slow.

First of all, your graph data has to be stored as the tabular edge representation (e.g. pandas.DataFrame).

StartNode EndNode EdgeType
E1 P1 PUBLISHED
E7 P1 AUTHORED
... ... ...

However, StartNode and EndNode attributes have to represent seperate populations of nodes, so it would require some further transformations in your case.

Then, when you think of it, monopartite projection can be computed using the dot product between bipartite adjacency matrix and its transposition:

adj_df = graph_df[["StartNode", "EndNode"]].copy()

# buffer column which will be used for aggregation purposes to indicate adjacency
adj_df["Adjacent"] = 1

# Turn the dataframe to pivot form which indicates adjacency between account and device nodes
adj_df = pd.pivot_table(
    adj_df, values="Adjacent", index=["StartNode"], columns=["EndNode"], aggfunc=np.sum
)

# NAs have to be temporarily converted to 0 in order to allow multiplication
adj_df.fillna(0, inplace=True)

# Convert to numpy matrix form for the ease of calculations
adj_matrix = adj_df.values
del adj_df

# Compute the dot product of adjacency matrix and its transposition
adj_matrix = adj_matrix.dot(adj_matrix.T)

# recurrent links are not allowed, therefore diagonal should be filled with zeros
np.fill_diagonal(adj_matrix, 0)

# only one way connections relevant, therefore
# only the upper (or lower, doesn't matter) diagonal is analyzed
adj_matrix *= 1 - np.tri(*adj_matrix.shape, k=-1)

# converting 0s back to NAs, so that pandas.DataFrame can skip empty rows
adj_matrix[adj_matrix == 0] = np.nan

# stack method transforms the DataFrame to long form with multi-index
monopartite = pd.DataFrame(
    adj_matrix, index=adj_df.index, columns=adj_df.index
).stack()
monopartite.index.names = ["StartNode", "EndNode"]
del adj_matrix

# resetting the index to get the clean form of graph representation which is ready to import to Neo4j
monopartite = monopartite.reset_index()
monopartite.rename(columns={0: "strength"}, inplace=True)
monopartite["strength"] = monopartite["strength"].astype(int)

I can't tell you the execution time of this operation in Neo4j (I used the query @Nathan suggested above with the [*2] in the first line for around 140k edges), as after 1h I gave up on waiting. The python script executed in ~5sec.

Sorry if the script is not optimal / the answer is messy or not strictly related to your question. These are my early days on the forum :).

c4nd13
  • 1