3

I would like to rank items according to a given users preference (items liked by the user) based on a random walk on a directed bipartite graph using gremlin in groovy.

The graph has the following basic structure:

[User1] ---'likes'---> [ItemA] <---'likes'--- [User2] ---'likes'---> [ItemB]

Hereafter the query that I came up with:

def runRankQuery(def userVertex) {
    def m = [:]
    def c = 0
    while (c < 1000) {
        userVertex
            .out('likes')   // get all liked items of current or similar user
            .shuffle[0]     // select randomly one liked item
            .groupCount(m)  // update counts for selected item
            .in('likes')    // get all users who also liked item
            .shuffle[0]     // select randomly one user that liked item
            .loop(5){Math.random() < 0.5}   // follow liked edge of new user (feed new user in loop) 
                                            // OR abort query (restart from original user, outer loop)      
            .iterate()
        c++
    }
    m = m.sort {a, b -> b.value <=> a.value}
    println "intermediate result $m"
    m.keySet().removeAll(userVertex.out('likes').toList())
    // EDIT (makes no sense - remove): m.each{k,v -> m[k] = v / m.values().sum()}
    // EDIT (makes no sense - remove): m.sort {-it.value }
    return m.keySet() as List;
}

However this code does not find new items ([ItemB] in example above) but only the liked items of the given user (e.g. [ItemA]).

  • What do I need to change to feed a new user (e.g. [User2]) with the loop step back to the 'out('likes')' step in order to continue the walk?

  • Once this code is working, can it be seen as an implementation of 'Personalized PageRank'?


Here the code to run the example:

g = new TinkerGraph()

user1 = g.addVertex()
user1.name ='User1'
user2 = g.addVertex()
user2.name ='User2'
itemA = g.addVertex()
itemA.name ='ItemA'
itemB = g.addVertex()
itemB.name ='ItemB'

g.addEdge(user1, itemA, 'likes')
g.addEdge(user2, itemA, 'likes')
g.addEdge(user2, itemB, 'likes')

println runRankQuery(user1)

And the output:

intermediate result [v[2]:1000]
[]
==>null
gremlin> g.v(2).name
==>ItemA
gremlin> 
Faber
  • 1,504
  • 2
  • 13
  • 21

1 Answers1

1

I found this to be a really strange issue. I found several very strange problems which aren't easily explainable and in the end, I'm not sure why they are the way they are. The two big things that are strange to me are:

  1. I'm not sure if there is a problem with the shuffle step. It does not seem to randomize properly in your case here. I can't seem to recreate the problem outside of this case, so I'm not sure if it's somehow related to the size of your data or something else.
  2. I hit strange problems with use of Math.random() to break out of the loop.

Anyway, I think I've captured the essence of your code here with my changes that seem to do what you want:

runRankQuery = { userVertex ->
    def m = [:]
    def c = 0
    def rand = new java.util.Random()
    while (c < 1000) {
        def max = rand.nextInt(10) + 1
        userVertex._().as('x')
            .out('likes')   
            .gather.transform{it[rand.nextInt(it.size())]}
            .groupCount(m) 
            .in('likes')    
            .gather.transform{it[rand.nextInt(it.size())]}
            .loop('x'){it.loops < max}  
            .iterate()
        c++
    }
    println "intermediate result $m"
    m.keySet().removeAll(userVertex.out('likes').toList())
    m.each{k,v -> m[k] = v / m.values().sum()}
    m.sort {-it.value }
    return m.keySet() as List;
}

I replaced shuffle with my own brand of "shuffle" by randomly selecting a single vertex from the gathered list. I also randomly selected a max loops rather than relying on Math.random(). When I run this now, I think I get the results you are looking for:

gremlin> runRankQuery(user1)                                       
intermediate result [v[2]:1787, v[3]:326]
==>v[3]
gremlin> runRankQuery(user1)
intermediate result [v[2]:1848, v[3]:330]
==>v[3]
gremlin> runRankQuery(user1)
intermediate result [v[2]:1899, v[3]:339]
==>v[3]
gremlin> runRankQuery(user1)
intermediate result [v[2]:1852, v[3]:360]
==>v[3]

You might yet get Math.random() to work as it did behave predictably for me on some iterations of working with this.

stephen mallette
  • 45,298
  • 5
  • 67
  • 135
  • Many thanks for your solution stephen. It kind of does what i want it to do. However your changes leave me with quite some other questions: 1) Why do we need `._()`? 2) So is the problem with `shuffle` a bug that needs to be tracked somewhere? 3) Why is no `scatter`required after `gather`? – Faber Jul 17 '14 at 13:41
  • 4) If I change the `loop` command to `.loop('x'){println "gremlinLoopCount: ${it.loops} / $max"; it.loops < max}` I see that the loop count always starts with 3 (!) and increases afterwards by +2. I even see outputs like `gremlinLoopCount: 3 / 0`. How is that possible? 5) So if we change from a variable walk length (`Math.random() < 0.5`) to a fixed number of steps (`it.loops < max`), can the algorithm still be considered as a random walk with teleportation/restart? I don't think so. – Faber Jul 17 '14 at 13:53
  • 1
    I just used the `_()` to mark the the step to loop back to with an `x`. I didn't think you could use `as` right off a vertex. I can't repro the `shuffle` problem so i'm not truly sure it is a bug. If you have repro steps with a simpler case, you can create an issue in Pipes. I randomly select one item from the pipe, so I've unrolled the `List` in the `transform`. `loop` is breadth first so you can't expect the `println it.loop` to print anything in order. It may look like its incrementing in twos, but it isn't. – stephen mallette Jul 17 '14 at 15:05
  • Also note that the count technically starts at "2" - i guess you might think of it as "off by one". This is fixed in TinkerPop3. Technically that means that we should see output that always starts at 2, which I have to admit I'm not seeing at the moment - need to research some more. – stephen mallette Jul 17 '14 at 15:06
  • 1
    I think "0" is a valid value for `max`, as the rand should give you a value between 0 and 9 (10 is exclusive). Editing my answer to add 1. Not sure if using this approach breaks the classic concept of a random walk. That's up to you :) You might try to go back to using the `rand` var again to break out of the loop. I just know that I seemed to have problems with it, which I didn't explore further. Perhaps my problems were related to `shuffle`. – stephen mallette Jul 17 '14 at 15:10