1

I'm currently working on importing GCP data into apache atlas and I defined typedefs to have parent child relationships as follows:

  1. gcp_bigquery_dataset has array of gcp_bigquery_tables as children
  2. gcp_bigquery_table has single gcp_bigquery_dataset as parent and array of gcp_bigquery_column as children.
  3. gcp_bigquery_column has a single gcp_bigquery_table as parent

I explored on how I can import entities on a dataset level and created the necessary Entity JSON which I can POST on to the V2 API of apache atlas.

I tried two ways:

Import API via Zip on Server:

The total zip size was 78 MB and it had 57 individual entities and a total of about 80,000 entities with relationships and referred entities.

The server stops processes 2 of the entiteis and stops at 3rd and does not respond back. After a while, zookeeper marks the session as expired and the import call does not return any status and it just keeps waiting.

Using POST on the V2 REST API

The issue is the JSON is created on a dataset level which results in a JSON of size more than 20 MB and server takes too long to respond. I was able to successfully import about 45,000 entities but these were imported with timeouts on the client side and I still have about 60,000 entities to be imported.

The timeout is causing issues because I'm trying to have synchronous calls with the Atlas server.

Is there any better way of performing bulk imports? Thanks in advance!

Tameem
  • 408
  • 7
  • 19
  • On the POST API calls, I've tried both entity and entity bulk api. But in both cases I'm passing a single entity which contains complete metadata of one dataset. i.e., one dataset, all tables under that dataset, and all columns under that table plus the relationship attributes (dataset --> table and table --> column) – Tameem Sep 08 '22 at 19:21

0 Answers0