1

I have directory in AzureML notebook in which I have 300k files and need to list their names. Approach below works but takes 1.5h to execute:

from os import listdir
from os.path import isfile, join
mypath = "./temp/"
docsOnDisk = [f for f in listdir(mypath) if isfile(join(mypath, f))]

What is the azure way to quickly list those files? (both notebook and this directory is in FileShare).

I am also aware that the approach below will give some gain, but still it is not the azure way to do this.

docsOnDisk = [f.name for f in scandir(mypath) ] # shall be 2-20x faster
RunTheGauntlet
  • 392
  • 1
  • 4
  • 15
  • 1
    Do you need a list variable representing each file name, or do you simply want to write out each name? If you need the list variable, what are you doing with it? Are the 4 lines you posted the entirety of the code that takes 1.5 hrs to run? – Dimitri Mar 31 '22 at 20:05
  • 1) I need list of files to save it down to csv and to be used by someone else. 2) Yes listing the files this way took 1.5h. – RunTheGauntlet Apr 04 '22 at 11:18
  • However scandir reduced the time to 15 minutes. – RunTheGauntlet Apr 05 '22 at 07:00

1 Answers1

1

Try using glob module and filter method instead of list comprehension.

import glob
from os.path import isfile
mypath = "./temp/*"
docsOnDisk = glob.glob(mypath)
verified_docsOnDisk = list(filter(lambda x:isfile(x), docsOnDisk))

glob should give only existing files. Its not needed to verify them by using isfile(). But still if you need to try it out then you can use filter method instead of list comprehension. To skip verification, you can comment last line.

YadneshD
  • 396
  • 2
  • 12