23

I would like to retrieve a specific number of random nodes. The graph consists of 3 000 000 nodes where some of them are sources, some are target and some are both.

The aim is to retrieve random sources and as I don't know how to select random, the program generates k random numbers from 1 to 3 000 000 which represent node IDs and then discards all randomly selected nodes that are not sources. As this procedure is time-consuming, I wonder whether it is possible to directly select random sources with cypher query.

In case to select all sources, the query would be the following

START t=node(*) MATCH (a)-[:LEADS_TO]->(t) RETURN a

Does anyone know how would it be possible to select the limited number of random nodes directly with a cypher or, if not possible, suggest any workaround?

mx0
  • 6,445
  • 12
  • 49
  • 54
Niko Gamulin
  • 66,025
  • 95
  • 221
  • 286

4 Answers4

29

You can use such construction:

MATCH (a)-[:LEADS_TO]->(t) 
RETURN a, rand() as r
ORDER BY r

It should return you random set of object.

Tested with Neo4j 2.1.3

Lukasz Stelmach
  • 5,281
  • 4
  • 25
  • 29
  • 4
    Nice, though for 3,000,000 nodes it might be slow as I think neo4j would load all of the nodes into memory to do the sort – Brian Underwood Dec 24 '14 at 14:51
  • 6
    Brilliant! This is more compact: MATCH (a)... RETURN a ORDER BY rand() – zakmck Sep 01 '19 at 11:57
  • 1
    @BrianUnderwood, as of 2019 it seems that the optimiser somehow knows it doesn't need to pre-load all nodes, it takes a few ms on my laptop. I've seen similar syntaxes in SPARQL and SQL, working efficiently too. – zakmck Sep 01 '19 at 11:59
  • This should be the approved answer. Very neat thanks. – Doug Apr 07 '21 at 13:13
13

You can limit your query with skip/limit so you could do

START t=node(*) 
MATCH (a)-[:LEADS_TO]->(t) 
RETURN a
SKIP {randomoffset} LIMIT {randomcount} 

Otherwise you can also create a set of random node-id's and pass them as parameter to the cypher statement.

Michael Hunger
  • 41,339
  • 3
  • 57
  • 80
  • Thanks Michael! I have already created a random set of nodes but not all of randomly generated ids correspond to source nodes - some are just end-nodes. I'll apply your suggestion. – Niko Gamulin Sep 20 '12 at 17:56
  • 2
    in this case the offset is random but would the set be contiguous in some way? i.e. if the randomcount was 100 would the 100 records be returned according to node ids or is it a random sorting with every call? – MonkeyBonkey Jan 12 '13 at 12:18
  • as far as I understand, it would. WDYT, Michael? – fbiville Apr 24 '14 at 15:34
  • @MonkeyBonkey I just did a small test-set and yes, the results would be contiguous. For example if `SKIP 10 LIMIT 3` gives you `[10, 11, 12]`, then `SKIP 10 LIMIT 2` will always give you `[10, 11]` – Eric Olson Jun 19 '14 at 18:26
-1

Another way of the one suggested here, for case you want a random Start nodes with all there connections is:

MATCH (a)-[:LEADS_TO]->[]
WITH a,rand() AS rand
ORDER BY rand LIMIT {YourLimit}
MATCH (a)-[l:LEADS_TO]->(t)
RETURN a,l,t
Roee Gavirel
  • 18,955
  • 12
  • 67
  • 94
-1
MATCH (n:Label)
WITH n, rand() AS r
ORDER BY r
RETURN n LIMIT <no. of random nodes>
pj ramya
  • 1
  • 1