I have some files in Azure Data Lake and I need to count how many lines they have to make sure they are complete. What would be the best way to do it?
I am using Python:
from azure.datalake.store import core, lib
adl_creds = lib.auth(tenant_id='fake_value', client_secret='fake_another value', client_id='fake key', resource='https://my_web.azure.net/')
adl = core.AzureDLFileSystem(adl_creds, store_name='fake account')
file_path_in_azure = "my/path/to/file.txt"
if adl.exists(file_path_in_azure) is True:
# 5 megs 5242880 500megs 524288000 100megs 104857600 1meg 1048576
counter = 0
with adl.open(file_path_in_azure, mode="rb", blocksize=5242880) as f:
# i try to use list comprehension but the memory increase since make a list of 1 [1,1,1,1,1,1,1] and then sums all
# counter1 = sum(1 for line in f)
for line in f:
counter = counter + 1
print(counter)
This works, but it takes hours for files that are 1 or 2 gigabytes. Shouldn't this be faster? Might there be a better way?