A few observations. You haven't mentioned how long your tests actually took. That would have been useful.
Usually when writing to table storage you would choose a partition id which is well distributed. However, in your example, you are using the same partition id = 'PARTITION1'. In this case, you can try using the Cosmos table storage SDK to batch up your entities and load them in a go. You can only use the batching in Cosmos table storage SDK if all entities in your batch have the same partition key.
Documentation states that you can load upto 100 entities or 4MB which ever is met first.
I ran a quick test using the Cosmos table python SDK and batch 50 entities in a go. I was able to load 1000 entities in around 35s. A 10000 entity load took me 4:55. I'm not sure if this suits your requirements or you were expecting something faster. Also please note, that my example entity is slightly larger in size that yours. Below is my code and this looks fairly similar to yours. This should be fairly easy to put into an Azure function.
#USING COSMOS TABLE STORAGE API
from azure.cosmosdb.table import TableService, TableBatch
from datetime import datetime
from random import randrange
import random
import names
import json
from datetime import timedelta
import uuid
acc_name = '<storageacct_name>'
acc_key = 'xxxxxxxxxxxxxxxxx'
table_name='xxxxxxx'
Number_of_docs = 10000
d1 = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')
d2 = datetime.strptime('1/1/2009 4:50 AM', '%m/%d/%Y %I:%M %p')
service = TableService(account_name=acc_name,
account_key=acc_key)
def random_date(start, end):
"""
This function will return a random datetime between two datetime
objects.
"""
delta = end - start
int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
random_second = randrange(int_delta)
return start + timedelta(seconds=random_second)
def create_entity(id):
my_entity = {"PartitionKey" : "PARTITION1",
"RowKey" : str(id),
"employee_id" : str(id),
"first_name" : names.get_first_name(),
"last_name" : names.get_last_name(),
"regn_no" : "TEST00000000"+ str(id),
"start_date" : random_date(d1,d2),
"salary" : random.randint(12000, 2000000),
"inserted_at" : str(datetime.now())
}
return my_entity
starttime = datetime.utcnow()
print("Starting ingestion: ", starttime.strftime("%Y-%m-%d %H:%M:%S.%f"))
batch_no = 0
batch = TableBatch()
for i in range(Number_of_docs):
i = i + 1
batch_no = batch_no + 1
my_entity = create_entity(i)
batch.insert_or_merge_entity(my_entity)
if (batch_no > 50):
service.commit_batch(table_name, batch)
batch_no = 0
batch = TableBatch()
else:
batch_no = batch_no + 1
if (batch_no > 0):
service.commit_batch(table_name, batch)
endtime = datetime.utcnow()
print("\nrun_sample done :" + endtime.strftime("%Y-%m-%d %H:%M:%S.%f"))
print("Time taken :" + str(endtime-starttime))
print("Number of messages :" + str(Number_of_docs))
EDIT : Apologies, I did realize that the Cosmos table SDK is being replaced by the azure data-tabes API as per article released 3 days ago. So I have rewritten this code using the new table storage SDK and tested it again. Results are infact better. 3:55 for 10000 entities. You can find more examples on how to use this new SDK HERE.
#USING THE NEW TABLE STORAGE API
from azure.data.tables import TableClient
from datetime import datetime
from random import randrange
import random
import names
import json
from datetime import timedelta
import uuid
conn='xxxxxxxxxxxxxxxxxxxxx;EndpointSuffix=core.windows.net'
tablename='mytable'
table_client = TableClient.from_connection_string(conn_str=conn,table_name=tablename)
Number_of_docs = 10000
d1 = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')
d2 = datetime.strptime('1/1/2009 4:50 AM', '%m/%d/%Y %I:%M %p')
def random_date(start, end):
"""
This function will return a random datetime between two datetime
objects.
"""
delta = end - start
int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
random_second = randrange(int_delta)
return start + timedelta(seconds=random_second)
def create_entity(id):
my_entity = {"PartitionKey" : "PARTITION1",
"RowKey" : str(id),
"employee_id" : str(id),
"first_name" : names.get_first_name(),
"last_name" : names.get_last_name(),
"regn_no" : "TEST00000000"+ str(id),
"start_date" : random_date(d1,d2),
"salary" : random.randint(12000, 2000000),
"inserted_at" : str(datetime.now())
}
return my_entity
starttime = datetime.utcnow()
print("Starting ingestion: ", starttime.strftime("%Y-%m-%d %H:%M:%S.%f"))
batch_no = 0
operations = []
for i in range(Number_of_docs):
i = i + 1
batch_no = batch_no + 1
my_entity = create_entity(i)
operations.append(("upsert", my_entity))
if (batch_no > 50):
table_client.submit_transaction(operations)
batch_no = 0
operations = []
else:
batch_no = batch_no + 1
if (batch_no > 0):
service.commit_batch(table_name, batch)
endtime = datetime.utcnow()
print("\nrun_sample done :" + endtime.strftime("%Y-%m-%d %H:%M:%S.%f"))
print("Time taken :" + str(endtime-starttime))
print("Number of messages :" + str(Number_of_docs))