0

I have many papers in Neo4j that cite each other.

The data looks like this:

{"title": "TitleWave", "year": 2010, "references": ["002", "003"], "id": "001"}
{"title": "Title002", "year": 2005, "references": ["003", "004"], "id": "002"}
{"title": "RealTitle", "year": 2000,  "references": ["004", "001"], "id": "003"}
{"title": "Title004", "year": 2014, "references": ["001", "002"], "id": "004"}

I have created the relationships by doing:

CALL apoc.load.json('file.txt') YIELD value AS q
MERGE (p:Paper {id:q.id}) ON CREATE SET 
p.title=q.title, 
p.refs = q.references
WITH p
MATCH (p) UNWIND p.refs AS ref
MATCH (p2:Paper {id: ref})
MERGE (p)-[:CITES]->(p2);

I would like to run the algo.PageRank.stream function to get a bunch of pagerank scores, and then normalize them for a big data set. Can I do this efficiently in one query?

This works to run the pagerank algorithm, but does not normalize:

CALL algo.pageRank.stream(
'MATCH (p:Paper) WHERE p.year < 2015 RETURN id(p) as id',
'MATCH (p1:Paper)-[:CITES]->(p2:Paper) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:20, write:false, concurrency:20})
YIELD node, score
WITH *,
node.title AS title,  
score AS page_rank,
log(score) AS impact,
ORDER BY impact DESC
LIMIT 100
RETURN title, page_rank, impact;

Is there a good way to normalize all of these impact values within the query? For example, one way to normalize would be to divide by the max value.

However, when I try doing this:

CALL algo.pageRank.stream(
'MATCH (p:Paper) WHERE p.year < 2015 RETURN id(p) as id',
'MATCH (p1:Paper)-[:CITES]->(p2:Paper) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:20, write:false, concurrency:20})
YIELD node, score
WITH *,
node.title AS title, 
score AS page_rank,
log(score) AS impact,
max(log(score)) as max_val,
impact / max_val as impact_norm
ORDER BY impact_norm DESC
LIMIT 100
RETURN title, page_rank, impact_norm;

I get an error:

Variable `impact` not defined (line 18, column 1 (offset: 539))
"impact / max_val as impact_norm"

Any suggestions would be greatly appreciated!

Tim Holdsworth
  • 489
  • 1
  • 3
  • 13
  • 1
    Could you expand on what you mean by "normalize"? Does the normalized set need to be linear `n/max` or can it be logarithmic/exponential `n/(n+100)`. Does it need to normalize to a value between 0 and 1? – Tezra Jul 24 '18 at 20:11
  • 1
    In your failed query, the new variables aren't available till after that statement, so you would need another WITH to do the divide. On that note, everything will normalize to 1 because MAX would be taking the max of 1 element for each row. (It's an aggregate, so it will only aggregate across sets where everything else is the same); so effectively the same as n/n for each row. – Tezra Jul 24 '18 at 20:18
  • Not sure what the best method is but to start I think normalized between 0 and 1. I'm realizing that having the log might also create some scores less than 0... – Tim Holdsworth Jul 24 '18 at 20:19
  • Understood. I want to find the max score out of all the scores, rather than just the scores for that row. Do you suggest I collect of all the scores and find the max within this collection then? – Tim Holdsworth Jul 24 '18 at 20:22
  • 1
    So will `n/(n+100)` work for your normalization? (it converts a linear growth into a percent. so 100=50%, 200=66.66%, 900=90%, 9000=98.9%. You can change the constant 100 to whatever you want.) Normalizing the values against themselves would be much cheaper than folding everything to get the max value, and then unfolding everything. – Tezra Jul 24 '18 at 20:33

0 Answers0