0

I am trying to loop over some compressed files (extension '.gz') and I am running into a problem. I want to perform a specific action when the FIRST file ending in 'aa' is encountered - it can be a random one, it doesn't necessarily have to be the first one on the list. Only then, Python has to search if there are OTHER "aa" files in the folder, if so the 2nd rule has to be applied. (There may be from 1 to many "aa" files). Finally, the 3rd rule has to be applied to all other files not ending with "aa".

However, when I run the code below, not all the files get processed.

What am I doing wrong?

Thanks!

inputPath = "write your path"
fileExt = r".gz"
    flag = False
    
    for item in os.listdir(inputPath): # loop through items in dir
        if item.endswith(fileExt): # check for ".gz" extension
            full_path = os.path.join(inputPath, item) # get full path of files
            
            
            if item.endswith('aa' + fileExt) and flag == False:
                df = pd.read_csv(full_path, compression='gzip', header=0, sep='|', encoding="ISO-8859-1") #from gzip to pandas df
    #           do something
                flag = True
                print('1 rule:', "The item processed is ", item)
             
            elif item.endswith('aa' + fileExt) and flag == True:
                df = pd.read_csv(full_path, compression='gzip', header=0, sep='|', encoding="ISO-8859-1") #from gzip to pandas df
    #           do something else
                print('2 rule:', "The item processed is ", item)
    
            elif not (item.endswith('aa' + fileExt)) and flag == True:    
                df = pd.read_csv(full_path, compression='gzip', header=0, sep='|', encoding="ISO-8859-1") #from gzip to pandas df
    #           do something else
                print('3 rule:', "The item processed is ", item)

I believe this is due to the fact that Python iterates over the list of files sorted in alphabetical order, then it the other files are ignored. How can I fix this issue?

LIST OF FILES:

File_202112311aa.gz
File_20211231ab.gz
File_20211231.gz
File_20211231aa.gz

OUTPUT
1 rule The item processed is  File_202112311aa.gz
3 rule The item processed is  File_20211231ab.gz
2 rule The item processed is  File_20211231aa.gz
DaniB
  • 200
  • 2
  • 15
  • @9769953 Thanks for your feedback, I will edit the code following your advice. Btw, this is the actual output. Only 3 out of 4 files are shown. – DaniB Jan 18 '22 at 19:37
  • two of the files have the same name. – Ollie in PGH Jan 18 '22 at 19:39
  • @OllieinPGH edited. – DaniB Jan 18 '22 at 19:40
  • I recommend using `glob.glob()` instead of `os.listdir()`. It will do the suffix check itself, and it returns full paths, so you don't need `item.endswith` or `os.path.join`. – Barmar Jan 18 '22 at 19:54
  • Take the search for the `aa` file out of the loop. Do that first, then loop over the other files. – Barmar Jan 18 '22 at 19:56
  • @Barmar if I used two for loops, the file ending with "aa" would be processed twice, no? – DaniB Jan 18 '22 at 20:13
  • Not if you skip them in the second loop. – Barmar Jan 18 '22 at 20:14
  • @Barmar I wanna perform a separate action when the first "aa" file is found, then another one when it is found a second/third/etc time. – DaniB Jan 18 '22 at 20:17
  • I don't think you want to wait *until* the first file ending in "aa" is encountered. Instead, it seems you first want to process the "aa.gz" file, then process all other files. Since you complain that the file "File_20211231.gz" is missing in the output, which indicates you *also* want to process that file, even if it comes *before* the "aa.gz" file. Is that correct? – 9769953 Jan 18 '22 at 20:30
  • Exactly, I want to find the first ''aa" file, perform a specific action and ONLY THEN process all the others, regardless of how the files are sorted in the Input folder. The file "aa" might also be found at the end of the list. @9769953 – DaniB Jan 18 '22 at 20:33
  • Then what's the problem with processing the `aa` file twice? – Barmar Jan 18 '22 at 20:47
  • @9769953 I'm getting really confused about what the OP wants. I suggested that he skip it in the second loop, then he said he wants to do additional operations on the file when he finds it the second time with all the others. – Barmar Jan 18 '22 at 21:18
  • @Barmar pls look at my comment under 9769953's solution. There I explain why the "aa" file cannot be processed twice. Let me know if you have any further questions – DaniB Jan 19 '22 at 11:43
  • Why are you checking `flag` for the non-`*aa.gz` files? That entire section should just be the `else`. – John Go-Soco Jan 19 '22 at 12:21
  • @JohnGo-Soco In that case, a non-aa.gz file might be processed before the (any) first aa.gz file. Depending on the ordering that os.listdir() returns. – 9769953 Jan 19 '22 at 12:25
  • @DaniB Earlier you said "I wanna perform a separate action when the first "aa" file is found, then another one when it is found a second/third/etc time." So that sounds like you want to process it twice. – Barmar Jan 19 '22 at 15:19

2 Answers2

2

Largely untested, but something along the following lines should work.

This code first processes a file ending in 'aa.gz' (note: not all files ending in 'aa.gz' are processed first, as this is not stated in the question), then processes the remaining files. There is no particular ordering for the remaining files: this will depend on how Python has been built on the system, and what the (file)system does by default, and is simply not guaranteed.

# Obtain an unordered list of compressed files
filenames = glob.glob("*.gz")

# Now find a filename ending with 'aa.gz'
for i, filename in enumerate(filenames):
    if filename.endswith('aa.gz'):
        firstfile = filenames.pop(i)
        # We immediately break out of the loop, 
        # so we're safe to have altered `filenames`
        break
else:  
    # the sometimes useful and sometimes confusing else part 
    # of a for-loop: what happens if `break` was not called:
    raise ValueError("no file ending in 'aa.gz' found!")

# Ignoring the `full_path` part
df = pd.read_csv(firstfile, compression='gzip', header=0, sep='|', encoding="ISO-8859-1")
# do something
print(f"1 rule: The file processed is {firstfile}")
          
# Process the remaining files
for filename in filenames:
    df = pd.read_csv(filename, compression='gzip', header=0, sep='|', encoding="ISO-8859-1")
    if filename.endswith('aa.gz'):
        # do something
        print(f"2 rule: The file processed is {filename}")
    else:
        # do something else
        print(f"3 rule: The file processed is {filename}")
9769953
  • 10,344
  • 3
  • 26
  • 37
  • You could simplify it by using `glob.glob("*aa.gz")` for the first loop, then `glob.glob("*.gz")` for the second loop. – Barmar Jan 18 '22 at 20:49
  • @Barmar Depending on how you intent to do that, that would become problematic if there are files `123aa.gz` and `456aa.gz`. You still have to filter out a single file ending with "aa.gz", and add the other file(s) to the next list. Also, your second glob pattern will read *all* files, including all the *aa.gz files, which means the first file will be processed again. – 9769953 Jan 18 '22 at 21:11
  • I think he wants the first file to be processed again. – Barmar Jan 18 '22 at 21:16
  • @9769953 Thanks for the solution and your time. There is only one problem. There should be 3 rules in total, the 1st one for the first "aa" file found (which works as expected with your code), the second rule, which should be applicable to other files ending with "aa" if any (DIFFERENT from the first "aa" file - therefore the first "aa" file shouldn't be processed twice). The 3rd rule - applicable to all other files, that do not end in "aa". To summarize, there may be 1 or several "aa" files, and the solution regardless of how they are sorted in the folder must enforce those rules. – DaniB Jan 19 '22 at 11:42
  • I edited the question. It should be more clear now. – DaniB Jan 19 '22 at 11:51
  • @DaniB Should all "rule 2" files (thus, remaining \*aa.gz files) be processed *before* the "rule 3" files (i.e., all other files)? – 9769953 Jan 19 '22 at 11:53
  • @9769953 Not necessarily. It can also be processed subsequent to Rule 3 – DaniB Jan 19 '22 at 11:55
  • @DaniB But the rule 2 and rule 3 files can be interspersed and processed in any order otherwise? It's just the process itself that is different for these file types? – 9769953 Jan 19 '22 at 11:58
  • Exactly, the order does not matter, because the other "aa" files will be processed differently than the others. – DaniB Jan 19 '22 at 12:04
  • @DaniB That is an easy if-else. See the edited answer. – 9769953 Jan 19 '22 at 12:05
0

Others here have provided much more optimised solutions for you, but this is to answer your original question of why not all files are being processed.

In your code, you've got three conditions to process a file:

  • It is a *aa.gz file, and it's the first one found
  • It is a *aa.gz file, and the second or more *aa.gz file that has been found.
  • It is not a *aa.gz file, and the a previous *aa.gz file has been found.

So it will skip any non-*aa.gz files until it encounters the first one.

John Go-Soco
  • 886
  • 1
  • 9
  • 20