CSV to Cloudant Import Large data set

Question

I am working on Nodejs based script to import large csv file into ibm cloudant database. My approach for this script is as following:

Script1: Download ZIP files from remote server using FTP connection.
Script1: Iterate over downloaded zip files and extract CSV files out of it.
Script1: Upload each CSV files to AWS S3 bucket using file read and write streams.
Script2: Connect to S3 bucket and iterate over each CSV file found over there at specific bucket location.
Script2: Read CSV file using stream and pipe it to convert CSV data in JSON format.
Script2: Connect to IBM cloudant database.
Script2: Take each JSON object one by one check for existing cloudant document by making GET request.
Script2: If document is found then update with JSON object, else create entry into cloudant as new document with separate API request for same.
Script2: Remove CSV file from S3 bucket if all CSV data got migrated to cloudant database successfully.

What works so far:

Downloading ZIP file from FTP extract CSV file from ZIP and upload it to S3 bucket.
Reading CSV file as stream and converting it to JSON object using npm package "csvtojson" works like charm if we just print each converted JSON object on terminal.
Connection to IBM cloudant database successful.

Problem being faced:

In a loop trying to check for existing records at cloudant for each JSON object, it just does not go beyond few records and ends up with memory / heap related error also it has issues with cloudant rate limit which gets saturated very quickly as the look up and import operations happens inside loop.

Exact error:

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory.
Too Many Requests on IBM Cloudant.

Notes:

It is possible there will be multiple ZIP files on FTP server.
It is possible there will be multiple CSV files within ZIP files.
One of the CSV file that I have been trying to import is comprising of about 1.3 million records and files size is about 300MB.
There is no option for me to think of increasing rate limit at IBM Cloudant for this requirement.

Can someone please suggest best way to tackle this requirement without facing issues as i have mentioned above? Any alternative npm package to tackle this requirement is also most welcomed. Any suggestion to make change in my development approach is also most welcomed.

Have you tried bulk operations like bulk insert and _all_docs() with keys? Instead of "Script2: Take each JSON object one by one check for existing cloudant document by making GET request." you could make a bulk get. You could pass an array of document id's or work with _all_docs() and batches. Are you dataset id's in the csv sorted? — Joshua Beckers, Apr 22 '20 at 19:01
@JayBee, thanks for your response. I don't know much about IBM cloudant, so "_all_docs()" i will have to check. Is there any example of performing bulk get? I can prepare array of document ids, but not sure how to use that with _all_docs() and batches. i see CSV rows are sorted with id starting with A and ends with Z. — Mayur Bhatt, Apr 23 '20 at 03:44

score 0 · Answer 1 · answered Apr 23 '20 at 03:58

Assuming that your CSV rows are sorted by id you can do first a bulk get:

Request:

POST /db/_all_docs HTTP/1.1
Accept: application/json
Content-Length: 70
Content-Type: application/json
Host: localhost:5984

{
    "keys" : [
        "a_id123",
        "b_id123",
        "c_id123",
        "d_id123",
    ]
}

You will then get something like the following: Response

{
    "total_rows" : 2666,
    "rows" : [
        {
            "value" : {
                "rev" : "1-a3544d296de19e6f5b932ea77d886942"
            },
            "id" : "a_id123",
            "key" : "a_id123"
        },
        {
            "value" : {
                "rev" : "1-91635098bfe7d40197a1b98d7ee085fc"
            },
            "id" : "b_id123",
            "key" : "b_id123"
        }
    ],
    "offset" : 0
}

Now you can loop through the response and see which documents exist and which not.

After this, you can bulk add the documents that don't exist:

Request:

POST /db/_bulk_docs HTTP/1.1
Accept: application/json
Content-Length: 109
Content-Type:application/json
Host: localhost:5984

{
    "docs": [
        {
            "_id": "c_id123",
            "foo": "bar",
        },
        {
            "_id": "d_id123",
            "foo": "bar c",
        },
        {
            "_id": "a_id123",
            "_rev": "1-a3544d296de19e6f5b932ea77d886942",
            "foo": "updated",
        }
    ]
}

Documentation for bulk get: https://docs.couchdb.org/en/stable/api/database/bulk-api.html#post--db-_all_docs

Documentation for bulk create/update: https://docs.couchdb.org/en/stable/api/database/bulk-api.html#db-bulk-docs

Thanks once again. I have tried the approach you suggested in your comment unfortunately the script is not able to provide me with proper response with data when i made `myDb.bulk_get({ "keys": uniqueKeys }, (err, foundData) => { if(err) { console.log(err); } else { console.log(foundData); } });` It ends up with error `description: 'couch returned 413'` — Mayur Bhatt, Apr 23 '20 at 06:26
@MayurBhatt Error 413 means that the result request is too large, try to split the requests into smaller chunks. Verify first if it works with maybe 10 data sets and then go bigger until you get the Error 413 again. — Joshua Beckers, Apr 23 '20 at 06:28

CSV to Cloudant Import Large data set

1 Answers1

Linked