Is there a way to speed this up - moving csvs from Azure Blob storage to vm, appending to single csv, using python

Question

My job collects multiple times a day from a streaming job and drops a csv into blob storage. After a few weeks of collecting data, I will run a python script to do some machine learning. To set up the training data, I am first moving all data within a range into a single csv on the virtual machine so it can train on that single csv in one pass.

Using the code below, I find that moving the data from blob storage to the virtual machine using blob_service.get_blob_to_path() takes on average 25 seconds per file, even if they're small 3mb files. The appending portion goes incredibly fast at milliseconds per file.

Is there a better way to do this? I thought increasing max connections would help, but I do not see any performance improvement.

blob_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)

# get list of blob files?
blobs = []
marker = None
while True:
    batch = blob_service.list_blobs(CONTAINER_NAME, marker=marker)
    blobs.extend(batch)
    if not batch.next_marker:
        break
marker = batch.next_marker

for blob in blobs:
    print(time.time()-start_time)
    split_name = blob.name.split('/')
    # year/month/day...
    blob_date = pd.to_datetime(str(split_name[0])+'-'+str(split_name[1])+'-'+str(split_name[2]))
    # s=arg start date, e=arg end date
    if blob_date > s and blob_date <= e:
        print('Appending: '+ blob.name, end='')
        blob_service.get_blob_to_path(CONTAINER_NAME,blob.name,
                                        './outputs/last_blob.csv',
                                        open_mode='wb',
                                        max_connections=6)
        print(' ... adding to training csv ' +str(time.time()-start_time))
        with open('./outputs/all_training_data.csv','ab') as f_out:
            with open('./outputs/last_blob.csv','rb') as blob_in:
                for line in blob_in:
                    f_out.write(line)

    else:
        print('** SKIPPING: '+ blob.name)

Additional notes: This is being done using Azure Machine Learning Workbench as part of my train.py process.

--edit--

Data Science VM and Storage Account are both in SC US. DSVM is DS4_V2 standard (8c cpu, 28gb memory). Total size of all blobs combined for my current test is probably close to 200MB.

I timed the copy and it goes very quickly heres some sample output, where the time prints line up with the code up top. The first file takes 13 seconds to download, 0.01 to append. Second takes 6 seconds, then 0.013 to append. Third takes 24 seconds to download.

1666.6139023303986
Appending: 2017/10/13/00/b1.csv ... adding to training csv 1679.0256536006927
1679.03680062294
Appending: 2017/10/13/01/b2.csv ... adding to training csv 1685.968115568161
1685.9810137748718
Appending: 2017/10/13/02/b3.csv ... adding to training csv 1709.5959916114807

This is all happening in the docker container thrown in the vm. I am not sure where it lands in terms of storage/premium/ssd. The VM itself has 56gb "local ssd" as the configuration for the ds4_v2.

##### BEGIN TRAINING PIPELINE
# Create the outputs folder - save any outputs you want managed by AzureML here
os.makedirs('./outputs', exist_ok=True)

I have not tried going the parallel route and would need a bit of guidance on how to tackle that.

Is your storage account in the same region as your VM (since hopping regions will add latency)? What size VM are you using (which would impact network performance)? Are you including the time it takes to copy line by line, once your file is local? Are you writing to a regular storage disk, vs SSD (premium) disk or SSD temp disk? Have you tried downloading multiple blobs in parallel? Lots of potential bottlenecks. Really not enough detail right now, and it could be so many things slowing you down (I just gave a few things to think about). — David Makogon, Oct 19 '17 at 14:21
Consider editing your question with more details, so this isn't closed as "too broad." — David Makogon, Oct 19 '17 at 14:22
This is weird. We download about five hundred megabytes archive sliced into 20-megabytes chunks sequentially in just about 30 seconds (or even less) on D1_v2. — sharptooth, Oct 19 '17 at 16:08

Is there a way to speed this up - moving csvs from Azure Blob storage to vm, appending to single csv, using python

0 Answers0