2

Here is my problem. I am trying to upload a large csv file to cosmos db (~14gb) but I am finding it difficult to maximize the throughput I am paying for. On the azure portal metrics overview UI, it says that I am using 73 RU/s when I am paying for 16600 RU/s. Right now, I am using pymongo's bulk write function to upload to the db but I find that any bulk_write length greater than 5 will throw a hard Request rate is large. exception. Am I doing this wrong? Is there a more efficient way to upload data in this scenario? Internet bandwidth is probably not a problem because I am uploading from an azure vm to cosmos db.

Structure of how I am uploading in python now:

for row in csv.reader:
    row[id_index_1] = convert_id_to_useful_id(row[id_index_1])

    find_criteria = {
        # find query
    }

    upsert_dict = {
        # row data
    }
    operations.append(pymongo.UpdateOne(find_criteria, upsert_dict, upsert=True))

    if len(operations) > 5:

        results = collection.bulk_write(operations)

        operations = []

Any suggestions would be greatly appreciated.

Aaron Arima
  • 164
  • 1
  • 11

4 Answers4

1

Aaron. Yes,as you said in the comment, migration tool is supported by Azure Cosmos DB MongoDB API. You could find the blow statement in the official doc.

The Data Migration tool does not currently support Azure Cosmos DB MongoDB API either as a source or as a target. If you want to migrate the data in or out of MongoDB API collections in Azure Cosmos DB, refer to Azure Cosmos DB: How to migrate data for the MongoDB API for instructions. You can still use the Data Migration tool to export data from MongoDB to Azure Cosmos DB SQL API collections for use with the SQL API.

I just provide you with a workaround that you could use Azure Data Factory. Please refer to this doc to make the cosmos db as sink.And refer to this doc to make the csv file in Azure Blob Storage as source.In the pipeline,you could configure the batch size.

enter image description here

Surely,you could do this programmatically. You didn't miss something, the error Request rate is large just means you have exceeded the provisioned RUs quota. You could raise up the value of RUs setting. Please refer to this doc.

Any concern,please feel free to let me know.

Jay Gong
  • 23,163
  • 2
  • 27
  • 32
  • hey thanks for answering! This is very useful since the file I am uploading in azure storage but is stored as a file and not as blob. I had a question about azure throughput distribution. It seems that I am getting the request rate is large because I am trying to write to one partition and am hitting the partition max. However, the amount of throughput per each partition (I have 14) is 100 which 14*100 is much less than 16600. Do you know how this throughput is distributed throughout the partitions and how I could increase the RU/s per partition? – Aaron Arima Aug 27 '18 at 22:27
  • @AaronArima Sorry,I don't understand what you said that the amount of throughput per each partition (I have 14) is 100. As I know, throughput setting will not divided by partitions. – Jay Gong Aug 28 '18 at 02:02
  • "The value of maximum throughput per partition(t) is configured by Azure Cosmos DB, this value is assigned based on total provisioned throughput and the hardware configuration used. " -- https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data – Aaron Arima Aug 29 '18 at 02:52
0

I'd take a look at the Cosmos DB: Data Migration Tool. I haven't used this with the MongoDB API, but it is supported. I have used this to move lots of documents from my local machine to Azure with great success, and it will utilize RU/s that are available.

If you need to do this programmatically, I suggest taking a look at the underlying source code for DB Migration Tool. This is open source. You can find the code here.

Rob Reagan
  • 7,313
  • 3
  • 20
  • 49
  • 1
    it says it doesn't support mongo api: The Data Migration tool does not currently support Azure Cosmos DB MongoDB API either as a source or as a target. If you want to migrate the data in or out of MongoDB API collections in Azure Cosmos DB, refer to Azure Cosmos DB: How to migrate data for the MongoDB API for instructions. You can still use the Data Migration tool to export data from MongoDB to Azure Cosmos DB SQL API collections for use with the SQL API. – Aaron Arima Aug 25 '18 at 19:14
  • But Thanks I think this led me in the right direction! – Aaron Arima Aug 25 '18 at 19:28
0

I was able to improve the upload speed. I noticed that each physical partition had a throughput limit (which for some reason, the number of physical partitions times the throughput per partition is still not the total throughput for the collection) so what I did was split the data by each partition and then create a separate upload process for each partition key. This increased my upload speed by (# of physical partitions) times.

Aaron Arima
  • 164
  • 1
  • 11
0

I have used ComsodDB Migration tool, which is awesome to send data to CosmosDB without doing much configurations. Even we can send the CSV files which are 14Gb also as per my assumption.

Below is the data which we transferred

[10000 records transferred | throughput 4000 | 500 parellel request | 25 seconds]. [10000 records transferred | throughput 4000 | 100 parellel request | 90 seconds]. [10000 records transferred | throughput 350 | parellel request 10 | 300 seconds].

  • I am not sure if this was what OP was looking for. perhaps this is more of a comment. – Jay Oct 05 '20 at 16:25
  • 1
    Yeah, I agree, I haven't gone through the question properly, However, it gives some understanding from the comment. @Jay – Musham Ajay Oct 30 '20 at 07:00