2

Suppose I have following simplified files structure

main_folder
    |__ foo.json
    |
    |__ sub_folder
          |__bar.json

I have two copies of the main_folder, e.g. main_folder_v1 and main_folder_v2

I want to compare both versions and get names of all files that differs (for example, get "foo.json" in case its content was updated in main_folder_v2)

And I use below code

import filecmp

comparison_result = filecmp.dircmp(main_folder_v1, main_folder_v2)
files_that_differs = comparison_result.diff_files

The problem is that I will get ["foo.json"] in case it was updated in main_folder_v2, but I will never get ["bar.json"] as it seem that comparison of files in sub_folder not performed

Is there any possibility to compare folders recursively using filecmp and get names of files that differs or os.walk() is the only solution?

Andersson
  • 51,635
  • 17
  • 77
  • 129
  • 1
    Whats wrong with `os.walk` – SuperStew May 03 '18 at 14:47
  • I agree with @SuperStew I think `os.walk` along with `set.symmetric_difference` would handle this – Cory Kramer May 03 '18 at 14:48
  • @CoryKramer, SuperStew, There is nothing wrong with `os.walk`. I was just looking for a solution as simple as `comparison_result.diff_files`. In case there is no such solution I will use `os.walk`... – Andersson May 03 '18 at 14:55

1 Answers1

3

[Python]: filecmp - File and Directory Comparisons supports recursive traversing via dircmp.subdirs. No need for os.walk (or any other similar functions).

code.py:

import sys
import filecmp
import os


main_folder_v1 = "dir_v1"
main_folder_v2 = "dir_v2"

ROOT_DIR_MARKER = ""


def traverse_dircmp(dircmp_obj, dir_name=ROOT_DIR_MARKER):
    for item in dircmp_obj.diff_files:
        yield os.path.join(dir_name, item)
    for subdir_name in dircmp_obj.subdirs:
        yield from traverse_dircmp(dircmp_obj.subdirs[subdir_name], dir_name=os.path.join(dir_name, subdir_name))
        #for item in traverse_dircmp(dircmp_obj.subdirs[subdir_name], dir_name=os.path.join(dir_name, subdir_name)):
        #    yield item


def traverse_dircmp_list(dircmp_obj, dir_name=ROOT_DIR_MARKER):
    ret = [os.path.join(dir_name, item) for item in dircmp_obj.diff_files]
    for subdir_name in dircmp_obj.subdirs:
        ret.extend(traverse_dircmp_list(dircmp_obj.subdirs[subdir_name], dir_name=os.path.join(dir_name, subdir_name)))
    return ret


def main():
    comparison_object = filecmp.dircmp(main_folder_v1, main_folder_v2)

    comparison_result = traverse_dircmp(comparison_object)
    print("{:s}: {:}".format("Different files (gen)", list(comparison_result)))

    comparison_result_list = traverse_dircmp_list(comparison_object)
    print("{:s}: {:}".format("Different files (list)", comparison_result_list))


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

Output (for a dir structure similar to yours):

(py35x64_test) e:\Work\Dev\StackOverflow\q050157870>"e:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32

Different files (gen): ['foo.json', 'subdir00\\bar.json', 'subdir00\\subdir001\\x.json']
Different files (list): ['foo.json', 'subdir00\\bar.json', 'subdir00\\subdir001\\x.json']

@EDIT0:

  • Modified the traverse_dircmp function to return the list of files, instead of printing them, as requested in one of the comments

@EDIT1:

  • Added generator functionality (as a personal exercise) which is the new (and preferred) style, and doesn't consume memory in case of huge dirs (!!requires Python3.3 or higher!!, or yield from statement can be replaced by the 2 commented (for and yield) lines below it)
CristiFati
  • 38,250
  • 9
  • 50
  • 87
  • This almost did the trick (+1), but instead of printing out file names from each subfolder I need to get complete list of all the file names that differs. Can `traverse_dircmp()` be modified in a way to append items from each subfolder to complete list and return it? – Andersson May 03 '18 at 16:02
  • I tried to define `my_list = []` outside the function and update `global my_list` by replacing `print("{:s} - {:}".format(dir_name, dircmp_obj.diff_files))` with `my_list.extend(dircmp_obj.diff_files)`, but I guess it's not the best idea...:) – Andersson May 03 '18 at 16:11
  • It was working before the last update. I can't try solution with generator now, but I'm sure it will work also. Thanks – Andersson May 03 '18 at 16:50