Python: Compare Filenames in Folder

Question

Filename1: Data_A_2015-07-29_16-25-55-313.txt

Filename2: Data_B_2015-07-29_16-25-55-313.txt

I need to compare all Fils in the Folder to make sure that for every TimeStamp, there's ONE A and ONE B File.

The second and centisecond part of the Filename is not always the same in both Files so what counts is the Date_%H:%M --> 2 Files for every Minute is what I'm looking for

(e.g.: Data_A_2015-07-29_16-25-55-313.txt and Data_B_2015-07-29_16-25-54-200.txt belong together)

I tried following code:

for root,dirs,files in os.walk(source):
for a_items in files:
    if a_items.__contains__("A"):
        A_ALL_List.append(a_items)         # One List with all A Files
        a_1 = a_item.split('_')[1]           # A Part
        a_2 = a_item.split('_',)[2]          # Date Part
        a_3 = a_item.split('_')[3]           # TimeStamp %H%M%S%SS incl. .txt
        a_4 = a_3.rsplit('.txt',1)[0]        # TimeStamp %H%N%S%SS excl. .txt
        a_5 = a_4.rsplit ('-',1)[0]          # TimeStamp %H%M%S
        a_6 = a_5.rsplit ('-',1)[0]          # TimeStamp %H%M
        a_lvl1 = a_1 + '_' + a_2 +'_' + a_3  # A_Date_FULLTimeStamp.txt
        A_Lvl1.append(a_lvl1)                # A_Date_TimeStamp.txt LIST
        a_lvl2 = a_lvl1.rsplit('.txt',1)[0]  # Split .txt
        A_Lvl2.append(a_lvl2)                # A_Date_TimeStamp LIST
        a_lvl3 = a_1 + '_' + a_2 + '_' + a_5 # A_Date_(%H%M%S)TimeStamp
        A_Lvl3.append(a_lvl3)                # A_Date_(%H%M%S)TimeStamp LIST
        a_lvl4 = a_2 + '_' + a_4             # Date_FULLTimeStamp
        A_Lvl4.append(a_lvl4)                # Date_FULLTimeStamp LIST
        a_lvl5 = a_2 + '_' + a_5             # Date_(%H%M%S)TimeStamp
        A_Lvl5.append(a_lvl5)                # Date_(%H%M%S)TimeStamp LIST
        a_lvl6 = a_2 + '_' + a_6             # Date_(%H%M)TimeStamp
        A_Lvl6.append(a_lvl6)                # Date_(%H%M)TimeStamp LIST
for b_items in files:                        # Did the same for B now
    if b_items.__contains__("B"):
        B_All_List.append(b_items)

That way I got Lists for both filenames containing only the Parts I want to compare --> e.g. if i'd compare the Lists A_Lvl6 with B_Lvl6 i'd would compare only the Date Part and the Hour and Minutes from the Timestamp.

I figured out, that there are more B Files than A Files so I moved on:

for Difference in B_Lvl6: # Data in B
if Difference not in A_Lvl6: # Not in A
    DiffList.append(Difference)

That way I got an output of Data where I had no A Files but B Files --> DiffList

Now I would like to look for the corresponding B Files from that DiffList (since there are no matching A Files) and move those B files into another Folder --> In Main Folder should only be A and B Files with matching TimeStamps (%H%M)

My question (finally):

How can I manage the last part, where I want to get rid of all A or B Files with no TimeStamp Partner.
Is my method a proper way to tackle a problem like this or is it completely insane? I've been using Python for 1.5 weeks now so any suggestions on packages and tutorials would be welcome.

Solution I used:

source='/tmp'

import os
import re`
import datetime as dt

pat=re.compile(r'^Data_(A|B)_(\d{4}-\d{2}-\d{2}_\d+-\d+-\d+-\d+)')
def test_list(l):
return (len(l)==2 and {t[1] for t in l}!=set('AB'))   

def round_time(dto, round_to=60):
     seconds = (dto - dto.min).seconds
     rounding = (seconds-round_to/2) // round_to * round_to
     return dto + dt.timedelta(0,rounding-seconds,-dto.microsecond)             

fnames={}
for fn in os.listdir(source):
    p=os.path.join(source, fn)
    if os.path.isfile(p):
         m=pat.search(fn)
         if m:   
           d=round_time(dt.datetime.strptime(m.group(2), '%Y-%m-%d_%H-%M-%S-%f'), round_to=60)
           fnames.setdefault(str(d), []).append((p, m.group(1)))

for k, v in [(k, v) for k, v in fnames.items() if not test_list(v)]:
   for fn in v:
      print fn[0]

My personal choice would be using a regular expression that can extract the meaningful parts of the filename. After this feeding the data into the right data structure makes the diff easy. For example a dictionary where the key is the date and the associated value is a list or set of A/B or whatever values. — pasztorpisti, Sep 06 '15 at 20:46
So I would create a diction for A and B files? Sounds good, I will look into dictionary systems — Peter S, Sep 06 '15 at 20:56
I know that using regular expressions may sound a bit overkill for some but in such cases I prefer doing exact matches instead of splitting "something". This way it's easier to handle several different filename patterns and the detection of unhandled patterns is easy. With named regex groups the code that handles the useful parts of the pattern is much more readable, you query groups by name instead of indexes. — pasztorpisti, Sep 06 '15 at 21:29

pasztorpisti · Answer 1 · 2015-09-06T21:45:54.300

I think simply ignoring the second and the millisecond part isn't a good idea. It can happen that one of your files has 01:01:59:999 and another one has 01:02:00:000. The difference is only one millisecond, but it affects the minute part too. A better solution would be parsing the datetimes and calculating the timedelta between them. But let's go with the simple stupid version. I thought something like this could do the job. Tailor it to your needs if it isn't exactly what you need:

import os
import re

pattern = re.compile(r'^Data_(?P<filetype>A|B)_(?P<datetime>\d\d\d\d\-\d\d\-\d\d_\d\d\-\d\d)\-\d\d\-\d\d\d\.txt$')

def diff_dir(dir, files):
    a_set, b_set = {}, {}
    sets  = {'A': a_set, 'B': b_set}
    for file in files:
        path = os.path.join(dir, file)
        match = pattern.match(file)
        if match:
            sets[match.group('filetype')][match.group('datetime')] = path
        else:
            print("Filename doesn't match our pattern: " + path)
    a_datetime_set, b_datetime_set = set(a_set.keys()), set(b_set.keys())
    a_only_datetimes = a_datetime_set - b_datetime_set
    b_only_datetimes = b_datetime_set - a_datetime_set
    for dt in a_only_datetimes:
        print(a_set[dt])
    for dt in b_only_datetimes:
        print(b_set[dt])

def diff_dir_recursively(rootdir):
    for dir, subdirs, files in os.walk(rootdir):
        diff_dir(dir, files)

if __name__ == '__main__':
    # use your root directory here
    rootdir = os.path.join(os.path.dirname(__file__), 'dir')
    diff_dir_recursively(rootdir)

Thanks , but as far as I´m informed the measurements are all loged in the same minute, so I won't loose files because of a 01:01:59:999 and another one has 01:02:00:000 scenario ... I hope :-) Thanks for your code I`ll try it — Peter S, Sep 07 '15 at 08:55

CivFan · Answer 2 · 2015-09-06T21:38:46.947

I wanted to post a partial answer to point out how you can assign the results of the split to names, and give them meaningful names. This usually makes solving the problem a little easier.

def match_files(files):
    result = {}

    for filename in files:
        data, letter, date, time_txt = filename.split('_')
        time, ext = time_txt.split('.')
        hour, min, sec, ns = time.split('-')

        key = date + '_' + hour + '-' + min

        # Initialize dictionary if it doesn't already exist.
        if not result.has_key(key):
            result[key] = {}

        result[key][letter] = filename


    return result



filename1 = 'Data_A_2015-07-29_16-25-55-313.txt'
filename2 = 'Data_B_2015-07-29_16-25-55-313.txt'

file_list = [filename1, filename2]


match_files(file_list)

Output:

In [135]: match_files(file_list)
Out[135]: 
{'2015-07-29_16-25': {'A': 'Data_A_2015-07-29_16-25-55-313.txt',
  'B': 'Data_B_2015-07-29_16-25-55-313.txt'}}

dawg · Accepted Answer · 2015-09-07T17:00:05.867

Given these five file names:

$ ls Data*
Data_A_2015-07-29_16-25-55-313.txt  
Data_B_2015-07-29_16-25-54-200.txt
Data_A_2015-07-29_16-26-56-314.txt  
Data_B_2015-07-29_16-26-54-201.txt
Data_A_2015-07-29_16-27-54-201.txt

You can use a regex to locate the key info: Demo

Since we are dealing with time stamps, the time should be rounded to the nearest time mark of interest.

Here is a function that will round up or down to the closest minute:

import datetime as dt

def round_time(dto, round_to=60):
    seconds = (dto - dto.min).seconds
    rounding = (seconds+round_to/2) // round_to * round_to
    return dto + dt.timedelta(0,rounding-seconds,-dto.microsecond)

Combine that with looping though the files, you can combine into a dictionary of lists with the key being the time stamp rounded to a minute.

(I suspect that your files are all in the same directory, so I am showing this with os.listdir instead of os.walk since os.walk recursively goes through multiple directories)

import os
import re
import datetime as dt

pat=re.compile(r'^Data_(A|B)_(\d{4}-\d{2}-\d{2}_\d+-\d+-\d+-\d+)')

fnames={}
for fn in os.listdir(source):
    p=os.path.join(source, fn)
    if os.path.isfile(p):
        m=pat.search(fn)
        if m:   
            d=round_time(dt.datetime.strptime(m.group(2), '%Y-%m-%d_%H-%M-%S-%f'), round_to=60)
            fnames.setdefault(str(d), []).append((p, m.group(1)))

print fnames

Prints:

{'2015-07-29 16:28:00': [('/tmp/Data_A_2015-07-29_16-27-54-201.txt', 'A')], '2015-07-29 16:27:00': [('/tmp/Data_A_2015-07-29_16-26-56-314.txt', 'A'), ('/tmp/Data_B_2015-07-29_16-26-54-201.txt', 'B')], '2015-07-29 16:26:00': [('/tmp/Data_A_2015-07-29_16-25-55-313.txt', 'A'), ('/tmp/Data_B_2015-07-29_16-25-54-200.txt', 'B')]}

The five files have one file that does not have a pair. You can filter for all the the file lists that are not length two or do not have an A and B pair match.

First, define a test function that will test for that:

def test_list(l):
    return (len(l)==2 and {t[1] for t in l}==set('AB'))

Then use a list comprehension to find all the entries from the dict that do not meet your conditions:

>>> [(k, v) for k, v in fnames.items() if not test_list(v)]
[('2015-07-29 16:28:00', [('/tmp/Data_A_2015-07-29_16-27-54-201.txt', 'A')])]

Then act on those files:

for k, v in [(k, v) for k, v in fnames.items() if not test_list(v)]:
    for fn in v:
        print fn  # could be os.remove(fn)

The same basic method works with os.walk but you may have files in multiple directories.

Here is the complete listing:

source='/tmp'

import os
import re
import datetime as dt

pat=re.compile(r'^Data_(A|B)_(\d{4}-\d{2}-\d{2}_\d+-\d+-\d+-\d+)')

def test_list(l):
    return (len(l)==2 and {t[1] for t in l}==set('AB'))   

def round_time(dto, round_to=60):
    seconds = (dto - dto.min).seconds
    rounding = (seconds+round_to/2) // round_to * round_to
    return dto + dt.timedelta(0,rounding-seconds,-dto.microsecond)             

fnames={}
for fn in os.listdir(source):
    p=os.path.join(source, fn)
    if os.path.isfile(p):
        m=pat.search(fn)
        if m:   
            d=round_time(dt.datetime.strptime(m.group(2), '%Y-%m-%d_%H-%M-%S-%f'), round_to=60)
            fnames.setdefault(str(d), []).append((p, m.group(1)))

for k, v in [(k, v) for k, v in fnames.items() if not test_list(v)]:
    for fn in v:
        print fn[0]    # This is the file that does NOT have a pair -- delete?

wow thanks that looks promising! however, the last to parts are not quite clear for me. where do I have to put? `[(k, v) for k, v in fnames.items() if not test_list(v)]` — Peter S, Sep 07 '15 at 08:52
There is s a problem with the round file: `Data_A_2015-07-29_16-25-55-313.txt` would round up to 16:26 --> but it's a file for the TimeStamp 16-25, or am i seeing this woring — Peter S, Sep 07 '15 at 09:33
`{'2015-07-29_16-26': ['/tmp/Data_A_2015-07-29_16-25-55-313.txt','/tmp/Data_B_2015-07-29_16-25-54-200.txt'],` — Peter S, Sep 07 '15 at 09:45
I don't get why in your Prints, `2015-07-29:16_25`: Includes `'/tmp/Data_A_2015-07-29_16-25-55-313.txt'` --> 16-25-55-313 should get rounded up to 16:26:00 ? — Peter S, Sep 07 '15 at 09:53
I fixed the round up problem: and changed `rounding = (seconds + round_to/2) // round_to * round_to` to `rounding = (seconds - round_to/2) // round_to * round_to` ... changed the + with an - , still i can`t get the last two parts to work --> when using 5 files, just like you i never end up with only one output (4 have A and B , 1 doesn't) — Peter S, Sep 07 '15 at 11:09
`I don't get why in your Prints...` Because I did not update what it printed. Fixed — dawg, Sep 07 '15 at 16:55
I think the time rounding is working correctly but it may not be what you need. I cannot see all you files, but usually in a case like (imprecise floating value that needs to be put into buckets) you round. Depends how you want to define the buckets for your pairs. — dawg, Sep 07 '15 at 17:06
I tried but in the end `print (fn[0])` lists me all items in the folder, those with A and B and those with just one A and B :-( Filenames are: A_File :`IS_HR_NIR_outd_cal_HRC06991_2015-07-29_16-25-55-313.txt` and B_File is `IS_HR_NIR_outd_cal_NQ51A05902_2015-08-12_03-35-52-247.txt` I Updated the the Regex so that should work (and it does). When I `print (fnames)` the output is `{'2015-07-29 16:28:00': [('/Path/IS_HR_NIR_outd_cal_HRC06991_2015-07-29_16-28-55-891.txt', 'HRC06991'), ('/Path/IS_HR_NIR_outd_cal_NQ51A05902_2015-07-29_16-28-55-891.txt', 'NQ51A05902')]` but `print (fn[0])` fails — Peter S, Sep 07 '15 at 20:34
Ok now it is working flawlessly: I changed the Test List settings: `def test_list(l): return (len(l)==2 and {t[1] for t in l}==set('AB')) ` and changed `==set('AB'))`with `!=set('AB')` and used my modified rounding version. Thanks a lot for your support — Peter S, Sep 08 '15 at 07:33

Python: Compare Filenames in Folder

3 Answers3

Linked