4

I run Windows 10, Python 3.7, and have a 6-core CPU. A single Python thread on my machine submits 1,000 inserts per second to grakn. I'd like to parallelize my code to insert and match even faster. How are people doing this?

My only experience with parellelization is on another project, where I submit a custom function to a dask distributed client to generate thousands of tasks. Right now, this same approach fails whenever the custom function receives or generates a grakn transaction object/handle. I get errors like:

Traceback (most recent call last):
  File "C:\Users\dvyd\.conda\envs\activefiction\lib\site-packages\distributed\protocol\pickle.py", line 41, in dumps
    return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
...
  File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__

I've never used Python's multiprocessing module directly. What are other people doing to parallelize their queries to grakn?

davideps
  • 541
  • 3
  • 13

2 Answers2

1

The easiest approach that I've found to execute a batch of queries is to pass a Grakn session to each thread in a ThreadPool. Within each thread you can manage transactions and of course do some more complex logic:

from grakn.client import GraknClient
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial

def write_query_batch(session, batch):
    tx = session.transaction().write()
    for query in batch:
        tx.query(query)
    tx.commit()

def multi_thread_write_query_batches(session, query_batches, num_threads=8):
    pool = ThreadPool(num_threads)
    pool.map(partial(write_query_batch, session), query_batches)
    pool.close()
    pool.join()

def generate_query_batches(my_data_entries_list, batch_size):
    batch = []
    for index, data_entry in enumerate(my_data_entries_list):
        batch.append(data_entry)
        if index % batch_size == 0 and index != 0:
            yield batch
            batch = []
    if batch:
        yield batch


# (Part 2) Somewhere in your application open a client and a session
client = GraknClient(uri="localhost:48555")
session = client.session(keyspace="grakn")

query_batches_iterator = generate_query_batches(my_data_entries_list, batch_size)
multi_thread_write_query_batches(session, query_batches_iterator, num_threads=8)

session.close()
client.close()

The above is a generic method. As a concrete example, you can use the above (omitting part 2) to parallelise batches of insert statements from two files. Appending this to the above should work:

files = [
    {
        "file_path": f"/path/to/your/file.gql",
    },
    {
        "file_path": f"/path/to/your/file2.gql",
    }
]

KEYSPACE = "grakn"
URI = "localhost:48555"
BATCH_SIZE = 10
NUM_BATCHES = 1000

# ​Entry point where migration starts
def migrate_graql_files():
    start_time = time.time()

    for file in files:
        print('==================================================')
        print(f'Loading from {file["file_path"]}')
        print('==================================================')

        open_file = open(file["file_path"], "r")  # Here we are assuming you have 1 Graql query per line!
        batches = generate_query_batches(open_file.readlines(), BATCH_SIZE)

        with GraknClient(uri=URI) as client:  # Using `with` auto-closes the client
            with client.session(KEYSPACE) as session:  # Using `with` auto-closes the session
                multi_thread_write_query_batches(session, batches, num_threads=16)  # Pick `num_threads` according to your machine

        elapsed = time.time() - start_time
        print(f'Time elapsed {elapsed:.1f} seconds')

    elapsed = time.time() - start_time
    print(f'Time elapsed {elapsed:.1f} seconds')

if __name__ == "__main__":
    migrate_graql_files()

You should also be able to see how you can load from a csv or any other file type in this way, but taking the values you find in that file and substitution them into Graql query string templates. Take a look at the migration example in the docs for more on that.

James Fletcher
  • 920
  • 11
  • 13
  • Thanks James. I read in the docs that "dummy" mimics multiprocessing's API but is just a wrapper over threading. That explains why I didn't see a speed up in my code. There is still just one process running on one core. When I remove "dummy" in my code, I get the same familiar error: grpc._cython.cygrpc.Channel.__reduce_cython__ TypeError: no default reduce due to non-trivial cinit. I haven't yet checked if this pops up in your code without "dummy". – davideps Jan 21 '20 at 09:04
  • I believe that http://ray.readthedocs.io/en/latest/actors.html do not use Python's multiprocessing module so may be another option. However, ray is not yet available for Windows. – davideps Jan 21 '20 at 16:15
  • Typically I have found that multithreading is helpful enough, since each query handled by Grakn can be performed in a new process, and the parallelisation of Grakn's operations is important for high throughput and/or long running queries. – James Fletcher Jan 22 '20 at 13:09
  • That makes sense. Perhaps the Python side is more I/O bound than I realize. I'm still not sure why multi-threading didn't result in a speedup in my code. – davideps Jan 22 '20 at 19:30
  • Did you have any luck identifying the issue? – James Fletcher Feb 20 '20 at 14:43
  • Hi James. I've done work in other aspects of my application but have not come back to this issue yet. Since speed is generally a concern, I may use a different database when the simulation is running and then load into grakn later. – davideps Feb 20 '20 at 15:06
0

An alternative approach using multi-processing instead of multi-threading follows below.

We empirically found that multi-threading doesn't yield particularly large performance gains, compared to multi-processing. This is probably due to Python's GIL.

This piece of code assumes a file enumerating TypeQL queries that are independent of each other, so they can be parallelised freely.

from typedb.client import TypeDB, TypeDBClient, SessionType, TransactionType
import multiprocessing as mp
import queue


def batch_writer(database, kill_event, batch_queue):
    client = TypeDB.core_client("localhost:1729")
    session = client.session(database, SessionType.DATA)
    while not kill_event.is_set():
        try:
            batch = batch_queue.get(block=True, timeout=1)
            with session.transaction(TransactionType.WRITE) as tx:
                for query in batch:
                    tx.query().insert(query)
                tx.commit()
        except queue.Empty:
            continue
    print("Received kill event, exiting worker.")

def start_writers(database, kill_event, batch_queue, parallelism=4):
    processes = []
    for _ in range(parallelism):
        proc = mp.Process(target=batch_writer, args=(database, kill_event, batch_queue))
        processes.append(proc)
        proc.start()
    return processes

def batch(iterable, n=1000):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]


if __name__ == '__main__':
    batch_size = 100
    parallelism = 1
    database = "<database name>"

    # filePath = "<PATH TO QUERIES FILE - ONE QUERY PER NEW LINE>"

    with open(file_path, "r") as file:
        statements = file.read().splitlines()[:]

    batch_statements = batch(statements, n=batch_size)
    total_batches = int(len(statements) / batch_size)
    if total_batches % batch_size > 0:
        total_batches += 1

    batch_queue = mp.Queue(parallelism * 4)
    kill_event = mp.Event()
    writers = start_writers(database, kill_event, batch_queue, parallelism=parallelism)
    for i, batch in enumerate(batch_statements):
        batch_queue.put(batch, block=True)
        if i*batch_size % 10000 == 0:
            print("Loaded: {0}/{1}".format(i*batch_size, total_batches*batch_size))
    kill_event.set()
    batch_queue.close()
    batch_queue.join_thread()
    for proc in writers:
        proc.join()
    print("Done loading")
flyingsilverfin
  • 400
  • 2
  • 14