1

I am trying to read a large (~1.5 GB) .txt file from Azure blob in python which is giving Memory Error. Is there a way in which I can read this file in an efficient way?

Below is the code that I am trying to run:

from azure.storage.blob import BlockBlobService
import pandas as pd
from io import StringIO
import time

STORAGEACCOUNTNAME= '*********'
STORAGEACCOUNTKEY= "********"

CONTAINERNAME= '******'
BLOBNAME= 'path/to/blob'

blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)

start = time.time()
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content

df = pd.read_csv(StringIO(blobstring))
end = time.time()

print("Time taken = ",end-start)

Below are last few lines of the error:

---> 16 blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
     17 
     18 #df = pd.read_csv(StringIO(blobstring))

~/anaconda3_420/lib/python3.5/site-packages/azure/storage/blob/baseblobservice.py in get_blob_to_text(self, container_name, blob_name, encoding, snapshot, start_range, end_range, validate_content, progress_callback, max_connections, lease_id, if_modified_since, if_unmodified_since, if_match, if_none_match, timeout)
   2378                                       if_none_match,
   2379                                       timeout)
-> 2380         blob.content = blob.content.decode(encoding)
   2381         return blob
   2382 

MemoryError:

How can I read a file of size ~1.5 GB in Python from a Blob container? Also, I want to have an optimum runtime for my code.

Udara Abeythilake
  • 1,215
  • 1
  • 20
  • 31
Shubham Singh
  • 91
  • 2
  • 12
  • looks like you've run out of ram, what I would do in your situation is read in the top 50 rows and see your `dtypes` and `columns` you can then choose your columns and `dtypes` which will make loading quicker. If you don't specify your `dtypes` pandas will guess which is very inefficient. – Umar.H Jun 23 '19 at 15:11

1 Answers1

4

Assumed that there is enough memory in your machine, and according to the pandas.read_csv API reference below, you can directly read the csv blob content into pandas dataframe by the csv blob url with sas token.

enter image description here

Here is my sample code as refenece for you.

from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta

import pandas as pd

account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_name = '<your csv blob name>'

url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"

service = BaseBlobService(account_name=account_name, account_key=account_key)
# Generate the sas token for your csv blob
token = service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)

# Directly read the csv blob content into dataframe by the url with sas token
df = pd.read_csv(f"{url}?{token}")
print(df)

I think it will avoid to copy memory few times when read the text content and convert it to a file-like object buffer.

Hope it helps.

Peter Pan
  • 23,476
  • 4
  • 25
  • 43