0

Below is an example from https://graphframes.github.io/graphframes/docs/_site/user-guide.html

the only thing I confused is the purpose of "lit(0)" from function of condition if this "lit(0)" mean to feed into "cnt"? if yes why is it after ["ab","bc","cd"]?

from pyspark.sql.functions import col, lit, when
from pyspark.sql.types import IntegerType
from graphframes.examples import Graphs
from functools import reduce

chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")

chain4.show()

sumFriends = lambda cnt,relationship: when(relationship == "friend", cnt+1).otherwise(cnt)

condition = reduce(lambda cnt,e: sumFriends(cnt, col(e).relationship), ["ab", "bc", "cd"], lit(0))

chainWith2Friends2 = chain4.where(condition >= 2)
chainWith2Friends2.show()
mck
  • 40,932
  • 13
  • 35
  • 50
gllow
  • 63
  • 2
  • 8

1 Answers1

1

lit(0) is the initializer of the reduce statement. You need to initialize the sumFriends counter with cnt = 0 to start counting.

condition = reduce(lambda cnt,e: sumFriends(cnt, col(e).relationship), ["ab", "bc", "cd"], lit(0))

# should be equivalent to

condition = sumFriends(lit(0), col("ab").relationship)
condition = sumFriends(condition, col("bc").relationship)
condition = sumFriends(condition, col("cd").relationship)
mck
  • 40,932
  • 13
  • 35
  • 50
  • Thanks for answering, one more question will be how is the function recognise cnt should be assigned by the initialiser ? – gllow Mar 29 '21 at 13:15
  • @gllow that's how the `reduce` function was defined in Python. You can have a look at the code example in the linked docs, especially the lines `value = initializer` and then `value = function(value, element)`. – mck Mar 29 '21 at 13:17
  • The initializer is used as the first argument of the provided lambda function. – mck Mar 29 '21 at 13:18