Python code performance on big data os.path.getsize

Question

Below is my code to get file size in ascending order.

def Create_Files_Structure(directoryname):
   for path, subdirs, files in os.walk(directoryname,followlinks=False):
        subdirs[:] = [d for d in subdirs if not d[0] == '.']
        try:
           files_list.extend([(os.path.join(path, file),os.path.getsize(os.path.join(path, file))) for file in files ])
        except Exception as e:
            print()
   files_list.sort(key=lambda s: s[1], reverse=True)
   for pair in files_list:
     print(pair)
   print(len(files_list))

start=time.time()
Create_Files_Structure("/home/<username>")
end=time.time()
print(end-start)

This code is working however performance is slow if size of a directory is in TB or PB. Any suggestion to improve the code to get faster result please.

as your code is working, a better place for these type of questions is [Code Review](https://codereview.stackexchange.com/) — Edwin van Mierlo, Nov 14 '17 at 12:07

orip · Accepted Answer · 2017-11-14T12:52:24.337

To get a feel for how fast you can get, try running and timing du -k on the directory. You probably won't be getting faster than that with Python for a full listing.
If you're running on Python < 3.5, try upgrading or using scandir for a nice performance improvement.
If you don't really need the whole list of files but can live with e.g the largest 1000 files:

Avoid keeping the list and use heapq.nlargest with a generator

def get_sizes(root):
  for path, dirs, files in os.walk(root):
    dirs[:] = [d for d in dirs if not d.startswith('.')]
    for file in files:
        full_path = os.path.join(path, file)
        try:
          # keeping the size first means no need for a key function
          # which can affect performance
          yield (os.path.getsize(full_path), full_path)
        except Exception:
          pass

import heapq
for (size, name) in heapq.nlargest(1000, get_sizes(r"c:\some\path")):
  print(name, size)

EDIT - to get even faster on Windows - os.scandir yields entries that already contain the size helping avoid another system call.

This means using os.scandir and recursing yourself instead of relying on os.walk which doesn't yield that information.

There's a similar working example get_tree_size() function in the scandir PEP 471 that can be easily modified to yield names and sizes instead. Each entry's size is accessible with entry.stat(follow_symlinks=False).st_size.

@brandonscript in Python >= 3.5 it's built-in, in older versions you would consider the library I linked that made it into the standard library — orip, Apr 04 '21 at 20:39
Yeah, sorry. In your 2. You use < instead of >=, and the way it was worded made it sound like you could use scandir if you’re on 3.5 or higher. — brandonscript, Apr 04 '21 at 20:41
Understood. I was answering in the context of the question which was looking for suggestions to make the code faster: upgrading to 3.5+ would make `os.walk` faster thanks to the underlying use of `scandir`, and using scandir directly as an external module would give the same effect for versions before 3.5. Maybe I could have been clearer — orip, Apr 04 '21 at 20:47

DRPK · Answer 2 · 2017-11-14T12:30:08.323

Nice Question Try This :

import time, os

def create_files_structuredire_2(ctoryname):

    files_list = []
    counter = 0

    for dirpath, _, filenames in os.walk(ctoryname):

        for items in filenames:

            file_full_path = os.path.abspath(os.path.join(dirpath, items))
            get_size = os.path.getsize(file_full_path)
            files_list.append((file_full_path, get_size))
            counter += 1

    files_list.sort(key=lambda s: s[1], reverse=True)
    [print(f) for f in files_list]
    print(counter)


start = time.time()
create_files_structuredire_2("your_target_folder")
end = time.time()
print(end-start)

NOTE : your time is 0.044736385345458984, my time is 0.001501321792602539 !!!!!!

Good Luck ...

You're not filtering the '.' subdirs from the scan - try timing it with that to compare — orip, Nov 14 '17 at 12:38

Python code performance on big data os.path.getsize

2 Answers2