0

I am having trouble parallelizing code that reads some files and writes to neo4j.

  • I am using dask to parallelize the process_language_files function (3rd cell from the bottom).
  • I try to explain the code below, listing out the functions (First 3 cells).
  • The errors are printed at the end (Last 2 cells).
  • I am also listing environments and package versions at the end.

If I remove dask.delayed and run this code sequentially, its works perfectly well.

Thank you for your help. :)

==========================================================================

Some functions to work with neo4j.

from neo4j import GraphDatabase
from tqdm import tqdm

def get_driver(uri_scheme='bolt', host='localhost', port='7687', username='neo4j', password=''):
"""Get a neo4j driver."""
    connection_uri = "{uri_scheme}://{host}:{port}".format(uri_scheme=uri_scheme, host=host, port=port)
    auth = (username, password)
    driver = GraphDatabase.driver(connection_uri, auth=auth)
    return driver

def format_raw_res(raw_res):
"""Parse neo4j results"""
    res = []
    for r in raw_res:
        res.append(r)
    return res

def run_bulk_query(query_list, driver):
"""Run a list of neo4j queries in a session."""
    results = []
    with driver.session() as session:
        for query in tqdm(query_list):
            raw_res = session.run(query)
            res = format_raw_res(raw_res)
            results.append({'query':query, 'result':res})
    return results

global_driver = get_driver(uri_scheme='bolt', host='localhost', port='8687', username='neo4j', password='abc123')  # neo4j driver object.=

This is how we create a dask client to parallelize.

from dask.distributed import Client
client = Client(threads_per_worker=4, n_workers=1)

The functions that the main code is calling.

import sys
import time
import json

import pandas as pd

import dask

def add_nodes(nodes_list, language_code):
"""Returns a list of strings. Each string is a cypher query to add a node to neo4j."""
    list_of_create_strings = []
    create_string_template = """CREATE (:LABEL {{node_id:{node_id}}})"""

    for index, node in nodes_list.iterrows():
        create_string = create_string_template.format(node_id=node['new_id'])
        list_of_create_strings.append(create_string)

    return list_of_create_strings
        
def add_relations(relations_list, language_code):
"""Returns a list of strings. Each string is a cypher query to add a relationship to neo4j."""
    list_of_create_strings = []
    create_string_template = """
        MATCH (a),(b) WHERE a.node_id = {source} AND b.node_id = {target} 
        MERGE (a)-[r:KNOWS {{ relationship_id:{edge_id} }}]-(b)"""

    for index, relations in relations_list.iterrows():
        create_string = create_string_template.format(
            source=relations['from'], target=relations['to'],
            edge_id=''+str(relations['from'])+'-'+str(relations['to']))
        list_of_create_strings.append(create_string)

    return list_of_create_strings


def add_data(language_code, edges, features, targets, driver):
"""Add nodes and relationships to neo4j"""
    add_nodes_cypher = add_nodes(targets, language_code)  # Returns a list of strings. Each string is a cypher query to add a node to neo4j.
    node_results = run_bulk_query(add_nodes_cypher, driver)  # Runs each string in the above list in a neo4j session.


    add_relations_cypher = add_relations(edges, language_code)  # Returns a list of strings. Each string is a cypher query to add a relationship to neo4j.
    relations_results = run_bulk_query(add_relations_cypher, driver)  # Runs each string in the above list in a neo4j session.    
    
    # Saving some metadata
    results = {
        "nodes": {"results": node_results, "length":len(add_nodes_cypher),}, 
        "relations": {"results": relations_results, "length":len(add_relations_cypher),}, 
    }
    return results


def load_data(language_code):
"""Load data from files"""
    # Saving file names to variables
    edges_filename = './edges.csv'
    features_filename = './features.json'
    target_filename = './target.csv'
    
    # Loading data from the file names
    edges = helper.read_csv(edges_filename)
    features = helper.read_json(features_filename)
    targets = helper.read_csv(target_filename)
    
    # Saving some metadata
    results = {
        "edges": {"length":len(edges),}, 
        "features": {"length":len(features),}, 
        "targets": {"length":len(targets),},
    }
    return edges, features, targets, results

The main code.

def process_language_files(process_language_files, driver):
"""Reads files, creates cypher queries to add nodes and relationships, runs cypher query in a neo4j session."""
    edges, features, targets, reading_results = load_data(language_code)  # Read files.

    writing_results = add_data(language_code, edges, features, targets, driver)  # Convert files nodes and relationships and add to neo4j in a neo4j session.
    
    return {"reading_results": reading_results, "writing_results": writing_results}  # Return some metadata


# Execution starts here
res=[]
for index, language_code in enumerate(['ENGLISH', 'FRENCH']):
    
    lazy_result = dask.delayed(process_language_files)(language_code, global_driver)
    res.append(lazy_result)

Result from res. These are dask delayed objects.

print(*res)
Delayed('process_language_files-a73f4a9d-6ffa-4295-8803-7fe09849c068') Delayed('process_language_files-c88fbd4f-e8c1-40c0-b143-eda41a209862')

The errors. Even if use dask.compute(), I am getting similar errors.

futures = dask.persist(*res)
AttributeError                            Traceback (most recent call last)
~/Code/miniconda3/envs/MVDS/lib/python3.6/site-packages/distributed/protocol/pickle.py in dumps(x, buffer_callback, protocol)
     48         buffers.clear()
---> 49         result = pickle.dumps(x, **dump_kwargs)
     50         if len(result) < 1000:

AttributeError: Can't pickle local object 'BoltPool.open.<locals>.opener

==========================================================================

# Name Version Build Channel
dask 2020.12.0 pyhd8ed1ab_0 conda-forge
jupyterlab 3.0.3 pyhd8ed1ab_0 conda-forge
neo4j-python-driver 4.2.1 pyh7fcb38b_0 conda-forge
python 3.9.1 hdb3f193_2

1 Answers1

1

You are getting this error because you are trying to share the driver object amongst your worker.

The driver object contains private data about the connection, data that do not make sense outside the process (and also are not serializable).

It is like trying to open a file somewhere and share the file descriptor somewhere else. It won't work because the file number makes sense only within the process that generates it.

If you want your workers to access the database or any other network resource, you should give them the directions to connect to the resource.

In your case, you should not pass the global_driver as a parameter but rather the connection parameters and let each worker call get_driver to get its own driver.

FxIII
  • 408
  • 5
  • 12