8

The popular graph database Neo4j can be used within R thanks to the package/driver RNeo4j (https://github.com/nicolewhite/Rneo4j).

The package author, @NicoleWhite, provides several great examples of its usage on GitHub.

Unfortunately for me, the examples given by @NicoleWhite and the documentation are a bit oversimplistic, in that they manually create each graph node and its associated labels and properties, such as:

mugshots = createNode(graph, "Bar", name = "Mugshots", location = "Downtown")
parlor = createNode(graph, "Bar", name = "The Parlor", location = "Hyde Park")
nicole = createNode(graph, name = "Nicole", status = "Student")
addLabel(nicole, "Person")

That's all good and fine when you're dealing with a tiny example dataset, but this approach isn't feasible for something like a large social graph with thousands of users, where each user is a node (such graphs might not utilize every node in every query, but they still need to be input to Neo4j).

I'm trying to figure out how to do this using vectors or dataframes. Is there a solution, perhaps invoving an apply statement or for loop?

This basic attempt:

for (i in 1:length(df$user_id)){
paste(df$user_id[i]) = createNode(graph, "user", name = df$name[i], email = df$email[i])
}

Leads to Error: 400 Bad Request

Community
  • 1
  • 1
Hack-R
  • 22,422
  • 14
  • 75
  • 131

1 Answers1

12

As a first attempt, you should look at the functionality I just added for the transactional endpoint:

http://nicolewhite.github.io/RNeo4j/docs/transactions.html

library(RNeo4j)

graph = startGraph("http://localhost:7474/db/data/")
clear(graph)

data = data.frame(Origin = c("SFO", "AUS", "MCI"),
                  FlightNum = c(1, 2, 3),
                  Destination = c("PDX", "MCI", "LGA"))


query = "
MERGE (origin:Airport {name:{origin_name}})
MERGE (destination:Airport {name:{dest_name}})
CREATE (origin)<-[:ORIGIN]-(:Flight {number:{flight_num}})-[:DESTINATION]->(destination)
"

t = newTransaction(graph)

for (i in 1:nrow(data)) {
  origin_name = data[i, ]$Origin
  dest_name = data[i, ]$Dest
  flight_num = data[i, ]$FlightNum

  appendCypher(t, 
               query, 
               origin_name = origin_name, 
               dest_name = dest_name, 
               flight_num = flight_num)
}

commit(t)

cypher(graph, "MATCH (o:Airport)<-[:ORIGIN]-(f:Flight)-[:DESTINATION]->(d:Airport)
               RETURN o.name, f.number, d.name")

Here, I form a Cypher query and then loop through a data frame and pass the values as parameters to the Cypher query. Your attempts right now will be slow, because you're sending a separate HTTP request for each node created. By using the transactional endpoint, you create several things under a single transaction. If your data frame is very large, I would split it up into roughly 1000 rows per transaction.

As a second attempt, you should consider using LOAD CSV in the neo4j-shell.

Nicole White
  • 7,720
  • 29
  • 31
  • This is an extremely useful solution to my problem, thank you! – Hack-R Aug 14 '14 at 14:20
  • 2
    The reason I would've preferred to use LOAD CSV in RNeo4j preceded by write.csv() instead of in your Neo4j shell is that doing it your way would require me to leave RStudio. Even for something simple like that, having to leave my IDE is a non-trivial interruption to my workflow and makes automation harder. That's why your RNeo4j package is awesome. – Hack-R Aug 15 '14 at 11:23
  • Is this still best practice @nicole-white? – geotheory Oct 25 '17 at 13:49
  • It would be nice if there was a block-add version to speed this up. A few hundred rows per second is pretty darn slow. – EngrStudent Dec 04 '17 at 11:19