Counting lines in Azure Data Lake

Question

I have some files in Azure Data Lake and I need to count how many lines they have to make sure they are complete. What would be the best way to do it?

I am using Python:

from azure.datalake.store import core, lib
adl_creds = lib.auth(tenant_id='fake_value', client_secret='fake_another value', client_id='fake key', resource='https://my_web.azure.net/')
adl = core.AzureDLFileSystem(adl_creds, store_name='fake account')

file_path_in_azure = "my/path/to/file.txt"
if adl.exists(file_path_in_azure) is True:
    # 5 megs 5242880 500megs 524288000 100megs 104857600 1meg 1048576
    counter = 0
    with adl.open(file_path_in_azure, mode="rb", blocksize=5242880) as f:
        # i try to use list comprehension but the memory increase since make a list of 1 [1,1,1,1,1,1,1] and then sums all
        # counter1 = sum(1 for line in f)
        for line in f:
            counter = counter + 1

print(counter)

This works, but it takes hours for files that are 1 or 2 gigabytes. Shouldn't this be faster? Might there be a better way?

score 0 · Answer 1 · answered Dec 19 '18 at 16:12

0

Do you need to count lines? Maybe it is enough to get size of the file? You have AzureDLFileSystem.stat to get the file size, If you know how long is an average line size you could calculate the expected line count.

answered Dec 19 '18 at 16:12

Tomasz Swider

2,314
18
22

each line might be different size, if we were writing the file we would padded the line to certain amount of characters that way you can divide the size of the file by the bytes in one string =( but in this case each line might be different. – pelos Dec 19 '18 at 16:51

score 0 · Answer 2 · answered Dec 19 '18 at 16:52

0

You could try:

for file in adl.walk('path/to/folder'):
    counter += len(adl.cat(file).decode().split('\n'))

I'm not sure if this is actually faster, but it uses the unix built ins to get file output which might be quicker than explicit I/O

EDIT: The one pitfall of this method is in the case that file sizes exceed the RAM of the device you run this on, as cat will throw the contents into memory explicitly

answered Dec 19 '18 at 16:52

C.Nivs

12,353
2
19
44

the files are big 1 or 2 gigs, so I cant put all the text into memory =( – pelos Dec 19 '18 at 19:57
I'd think that 1-2GB would be ok, even for a relatively small machine. Otherwise, you're kind of stuck with iterating through each file, unfortunately. I've struggled with some of the pitfalls of I/O from adl myself – C.Nivs Dec 19 '18 at 20:43

pelos · Answer 3 · 2018-12-26T20:00:29.870

0

The only faster way i found, was to actually download the file locally to where the script is running with

 adl.put(remote_file, locally)

and then count line by line with out putting all file into the memory, download 500mgs takes around 30secs and reading 1mill lines around 4 secs =)

edited Dec 26 '18 at 20:00

answered Dec 26 '18 at 17:56

pelos

1,744
4
24
34

Counting lines in Azure Data Lake

3 Answers3