1

My team has a program which generates lots of temporary files when it runs and deletes them once it's finished. Unfortunately, if the program is interrupted that means that these files get left in arbitrary places within the program's directory tree (usually alongside the individual scripts which created the files).

In order to make cleanup simpler for these cases we'd like to refactor the code to place all of the temporary files within a single designated directory.

The first step seems to be to get a list of all the temporary files which we're generating. I've managed to accomplish this as follows:

  1. Open a BASH shell
  2. cd to the program's directory
  3. run inotifywait -m --timefmt "%F %T" --format "%T %w %f %e" -r . >> modified_files.log
  4. Open another BASH shell
  5. Run the program in the new shell
  6. Wait several hours for the program to finish running
  7. Terminate the inotifywait process in the first shell. modified_files.log will now contain millions of lines (hundreds of megabytes) of output like this:

    2019-07-23 12:28:33 ./project/some_dir/ some_file OPEN
    2019-07-23 12:28:33 ./project/some_dir/ some_file MODIFY
    2019-07-23 12:28:33 ./project/some_dir/ some_file CLOSE_WRITE,CLOSE
    2019-07-23 12:28:33 ./project/some_other_dir/ some_other_file OPEN
    2019-07-23 12:28:33 ./project/some_other_dir/ some_other_file MODIFY
    2019-07-23 12:28:33 ./project/some_other_dir/ some_other_file CLOSE_WRITE,CLOSE
    
  8. Pass modified_files.log to the following script:

    #!/bin/bash -e
    
    # We'll store paths to the modified files here without any duplicates
    declare -A UNIQUE_FILES
    
    # Iterate over every line of output in modified_files.log
    while IFS= read -r line; do
    
        # In the first line from the output example this would find ./project/some_dir/
        directory="$(grep -Po ".*?\s.*?\s\K.*?(?=\s.*)" <<< "$line")"
    
        # In the first line from the output example this would find some_file
        file="$(grep -Po ".*?\s.*?\s.*?\s\K.*?(?=\s.*)" <<< "$line")"
    
        path="${directory}${file}"
    
        # Only record the path from this output line if we haven't already recorded it
        if [[ -n "$path" ]] && [[ -z "${UNIQUE_FILES["$path"]}" ]]; then
            UNIQUE_FILES["$path"]=1
        fi
    done < "$1"
    
    # Save all of the recorded paths as separate lines within a single 'list' variable
    for unique_file in "${!UNIQUE_FILES[@]}"; do
        list="${list}"$'\n'"${unique_file}"
    done
    
    # Sort the 'list' variable to make the list of paths visually easier to read
    list="$(echo "$list" | sort)"
    
    # Print the paths of all the modified files
    echo "$list"
    

This works, but it takes about a minute to parse for every megabyte of output produced by inotifywait. I feel like there ought to be a much faster way to do this next time the need arises. I'm hoping for solutions which address either:

  • Inefficiencies in the grep commands shown above (IE: perhaps using calls to sed/awk instead?)
  • Inefficiencies with the parsing script in general
  • Inefficiencies with the inotifywait command which I'm using (IE: remove the timestamps or call it with some special flags to reduce the verbosity)
  • Inefficiences with the general process listed above
Alex Jansen
  • 1,455
  • 5
  • 18
  • 34
  • 1
    I recommend looking into a cleanup trap on `EXIT` instead, like this: http://redsymbol.net/articles/bash-exit-traps/ – Benjamin W. Jul 23 '19 at 21:22
  • @BenjaminW. We do have one, but it only cleans up a particular directory. We're not even sure where all the other temporary files are going so it's difficult to clean them up right now. AFAIK a SIGKILL (or a power outage for that matter) will also bypass an exit trap, but that's another problem. – Alex Jansen Jul 23 '19 at 21:30
  • 1
    `touch ./canary; run-the-programm`. Then `find ./ -anewer canari -delete` – Léa Gris Jul 23 '19 at 21:46
  • 3
    You're relying on your filenames not containing blanks; this should have the same output as your script: `awk '{print $3 $4}' modified_files.log | sort -u` – Benjamin W. Jul 23 '19 at 22:32
  • @LéaGris That seems like a reasonable solution for cleaning up the files which were written. It would almost work for detecting the temporary files with a `-print` argument on the end (IE: run it once and see which files it deletes), but unfortunately it can't detect files which have already been deleted. – Alex Jansen Jul 24 '19 at 00:22
  • 1
    `strace -o >(sed -n 's/^[^"]*"\(.*\)".*$/\1/p' | sort -u >usedfiles) -f -e openat ... -- prog ...` – jhnc Jul 24 '19 at 00:37
  • @AlexJohnson Obviously find will not detect already deleted files. Then why are you bothering to have them listed for your cleanup? – Léa Gris Jul 24 '19 at 01:08
  • rewrite the whole `while` loop as an `awk` script. Only one process being started, not N(um of lines)*4 (maybe more). Good luck. – shellter Jul 24 '19 at 01:17
  • @jhnc Glancing through the man pages this looks like it covers all of my steps at once. Feel free to elaborate on it and post it as an answer! – Alex Jansen Jul 24 '19 at 01:20
  • @LéaGris I'd rather not have to search for the files every time. If we know what temporary files are being written then we can hunt down the bits of code which create them and change them to place the files in a single shared location. Then all we have to do is delete everything in that location on program start instead of searching around for stray files. – Alex Jansen Jul 24 '19 at 01:24
  • 1
    @AlexJohnson : For each single line of the input file, you create two child processes! No surprise that it takes quite long, if the input file is big. – user1934428 Jul 24 '19 at 07:06
  • @user1934428 yes, for some reason inotifywait separates the directory and file names with a space. I couldn't find a simple way to capture two groups at the same time with grep so I called it twice and appended them to format out that space. In hindsight it might have been more efficient to capture them both in a single group and then use BASH parameter expansion to strip the whitespace out without another subprocess call. Perhaps sed also has a way to do the search+strip in a single subprocess, but I'm not familiar with it. – Alex Jansen Jul 24 '19 at 09:03
  • 1
    @AlexJohnson : You could apply the bash regexp operator to pick the individual parts of the line you are interested in. In any case, avoid creating even a single child processes within a loop over a huge file. – user1934428 Jul 24 '19 at 11:32
  • @jhnc - I had to tweak the sed expression a bit (`s/^[0-9]*\s*openat(.*,\s"\(.*\)",\s.*WR.*$/\1/p`), but your solution worked perfectly for some jobs which ran last night. I'll describe+post it as an answer in a couple of days unless you'd still like to. – Alex Jansen Jul 24 '19 at 23:16
  • 2
    @AlexJohnson, ...btw, as a note -- you'll get orders-of-magnitude performance enhancements just from using native bash string manipulation in your inner loop, and **not** (ever!) running command substitutions, pipes, or other operations that involve forking per line of input read. (It's fine to run one `grep` that processes 10,000 lines; it's not at all fine to run 10,000 separate copies of `grep`, one per line). Fortunately, bash has very robust string manipulation primitives, so you don't *need* any of the `grep` bits in this code. – Charles Duffy Jul 25 '19 at 01:13
  • 2
    Rewrite your regexes from PCRE to POSIX ERE, and `[[ $string =~ $re ]]` will give you a very fast in-process match, putting the resulting match groups in the `BASH_REMATCH` array. – Charles Duffy Jul 25 '19 at 01:14

1 Answers1

2

strace may work, although it can cause performance issues.

You would look for files that have been opened for writing, or perhaps you could just check for files that are deleted/unlinked ( cf. System calls in Linux that can be used to delete files )

Filenames in strace output may be given relative to the current directory so you may want to log chdir() too.

The basic invocation would be:

strace -f -o LOGFILE -e WhatToTrace -- PROGRAM ARGUMENTS

Examples of syscalls to include in WhatToTrace are:

  • openat,open,creat - trace file access/creation
  • mkdirat,mkdir - trace directory creation
  • unlinkat,unlink,rmdir - find deleted files and directories
  • chdir - log when current working directory changes
  • renameat,rename - find overwritten files

Once you have your LOGFILE, you can write a simple script to process the paths that have been recorded.

jhnc
  • 11,310
  • 1
  • 9
  • 26
  • 2
    If you want something much, *much* faster than strace, look at sysdig and other clients to the eBPF subsystem. Whereas strace traces a single process with maybe 500% overhead (if you're lucky!), sysdig traces *your entire operating system at once* with on the scale of 3% overhead (dumping events into a ring buffer that's shuffled into userspace for later filtering/analysis). – Charles Duffy Jul 25 '19 at 00:51
  • (sysdig isn't *just* an eBPF client, but the eBPF version of their probe is more reliable with recent kernels -- though they reportedly just found a fix for that today, it's not yet merged, much less in a release). – Charles Duffy Jul 25 '19 at 00:53
  • @CharlesDuffy Can sysdig be used as an ordinary user without root to load kernel modules or other special permissions? – jhnc Jul 25 '19 at 01:00
  • Sysadmin cooperation is needed. (Since it isn't restricted to tracing just one user's processes, that'd be a major security vulnerability otherwise; anyone could snoop on *everyone else's* processes, data entry, etc). That said, there's a filter language available plus lua support for more advanced filters, so a sysadmin could easily stream a trace with just one user's events to somewhere that user could read it... – Charles Duffy Jul 25 '19 at 01:08
  • 1
    For the record: running my program with strace took 27% longer than it did without strace. sysdig looks like a good tool to turn to in situations where there's more of an impact though. – Alex Jansen Jul 25 '19 at 18:36