My team has a program which generates lots of temporary files when it runs and deletes them once it's finished. Unfortunately, if the program is interrupted that means that these files get left in arbitrary places within the program's directory tree (usually alongside the individual scripts which created the files).
In order to make cleanup simpler for these cases we'd like to refactor the code to place all of the temporary files within a single designated directory.
The first step seems to be to get a list of all the temporary files which we're generating. I've managed to accomplish this as follows:
- Open a BASH shell
cd
to the program's directory- run
inotifywait -m --timefmt "%F %T" --format "%T %w %f %e" -r . >> modified_files.log
- Open another BASH shell
- Run the program in the new shell
- Wait several hours for the program to finish running
Terminate the
inotifywait
process in the first shell.modified_files.log
will now contain millions of lines (hundreds of megabytes) of output like this:2019-07-23 12:28:33 ./project/some_dir/ some_file OPEN 2019-07-23 12:28:33 ./project/some_dir/ some_file MODIFY 2019-07-23 12:28:33 ./project/some_dir/ some_file CLOSE_WRITE,CLOSE 2019-07-23 12:28:33 ./project/some_other_dir/ some_other_file OPEN 2019-07-23 12:28:33 ./project/some_other_dir/ some_other_file MODIFY 2019-07-23 12:28:33 ./project/some_other_dir/ some_other_file CLOSE_WRITE,CLOSE
Pass
modified_files.log
to the following script:#!/bin/bash -e # We'll store paths to the modified files here without any duplicates declare -A UNIQUE_FILES # Iterate over every line of output in modified_files.log while IFS= read -r line; do # In the first line from the output example this would find ./project/some_dir/ directory="$(grep -Po ".*?\s.*?\s\K.*?(?=\s.*)" <<< "$line")" # In the first line from the output example this would find some_file file="$(grep -Po ".*?\s.*?\s.*?\s\K.*?(?=\s.*)" <<< "$line")" path="${directory}${file}" # Only record the path from this output line if we haven't already recorded it if [[ -n "$path" ]] && [[ -z "${UNIQUE_FILES["$path"]}" ]]; then UNIQUE_FILES["$path"]=1 fi done < "$1" # Save all of the recorded paths as separate lines within a single 'list' variable for unique_file in "${!UNIQUE_FILES[@]}"; do list="${list}"$'\n'"${unique_file}" done # Sort the 'list' variable to make the list of paths visually easier to read list="$(echo "$list" | sort)" # Print the paths of all the modified files echo "$list"
This works, but it takes about a minute to parse for every megabyte of output produced by inotifywait. I feel like there ought to be a much faster way to do this next time the need arises. I'm hoping for solutions which address either:
- Inefficiencies in the grep commands shown above (IE: perhaps using calls to sed/awk instead?)
- Inefficiencies with the parsing script in general
- Inefficiencies with the inotifywait command which I'm using (IE: remove the timestamps or call it with some special flags to reduce the verbosity)
- Inefficiences with the general process listed above