I am trying to parallelize meta ingestion job in my project where I am using Amundsen but I have facing issues. Below is the code snippet for the same. I am performing parallelization at Account level in Snowflake, the metadata that I am getting from Snowflake is ingested to Neo4J then.
def process_all_snowflake_accounts():
"""Function that loops through all the SF accounts"""
snowflake_config = read_snowflake_configuration()
start_time = time.time()
processes = []
for ac_key, ac_config in snowflake_config.items():
process = multiprocessing.Process(target=multiprocessing_snowflake_accounts, args=(ac_key, ac_config))
processes.append(process)
process.start()
for process in processes:
process.join()
print("CPU Unit: ", multiprocessing.cpu_count)
print('****************************************************************')
print('Total time taken: ', time.time() - start_time)
print('****************************************************************')
Few time the above code is just skipping some database of accounts and not showing any error, but most of the time it is showing the error mentioned below:
"Scanning Snowflake ..."
"Process account: Account-1"
"Process account: Account-2"
"Process account: Account-3"
"Launching job for Account-1-DB-1"
"Launching job for Account-2-DB-1"
"Launching job for Account-3-DB-1"
"Launching job for Account-2-DB-2"
"Launching job for Account-1-DB-2"
ERROR:databuilder.publisher.neo4j_csv_publisher:Failed to publish. Rolling back.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/amundsen_databuilder-2.6.4-py3.7.egg/databuilder/publisher/neo4j_csv_publisher.py", line 202, in publish_impl
tx = self._publish_node(node_file, tx=tx)
File "/usr/local/lib/python3.7/site-packages/amundsen_databuilder-2.6.4-py3.7.egg/databuilder/publisher/neo4j_csv_publisher.py", line 266, in _publish_node
with open(node_file, 'r', encoding='utf8') as node_csv:
FileNotFoundError: [Errno 2] No such file or directory: '/var/tmp/amundsen/tables/nodes/Description_4.csv'