0

I'm currently having an issue trying to create a UDF to access a graph DB for each record in my DataFrame. Example testdf:

**id name** 
  1  tom
  2  ben
  .. etc.

I have written a function which takes an id and looks into the Neptune Graph to see if the specific id is connected to another vertex. looks something like this.

def getEngineer(id):
  return g.V(f"{id}").repeat(__.out('knows').simplePath()).until(__.hasLabel('engineer')).dedup().elementMap('id').toList()

getEngineerUDF= udf(lambda z: getEngineer(z))  

I have wrapped this function into a UDF and trying to use it withColumn.

finDf = testdf.withColumn('EngInNeptune', getEngineerUDF(F.col('id')))

When I run the above command I receive this error:

    Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
  TypeError: cannot pickle '_queue.SimpleQueue' object

Will appreciate any help.(still pretty new sorry If I've missed something)

Are we able to implement something like this? I'm under the assumption that the Gremlin doesn't like being being put into a UDF due to how Spark handles them(concurrently?)?

ak97
  • 15
  • 1
  • 6
  • Does the discussion here help? https://stackoverflow.com/questions/44144584/typeerror-cant-pickle-thread-lock-objects – Kelvin Lawrence Jan 18 '22 at 20:10
  • Put getEngineer in a separate python module and have that module (which is run at the spark executor nodes) instantiate the g GraphTraversalSource as a singleton object. With only part of your code visible, it is hard to say what the offending part is, but this way you have the least chance of sending offending code to the executors. – HadoopMarc Jan 21 '22 at 07:24

0 Answers0