0

Trying to find a command that is flexible enough to allow for some variations of the string, but not other variations of it.

For instance, I am looking for audio files that have some variation of "rain" in the filename only (rains, raining, rained, rainbow, rainfall, a dark rain cloud, etc), whether at the beginning, end or middle of the filename.

However, this also includes words like "brain", "train", "grain", "drain", "Lorraine", et al, which are not wanted (basically any word that has nothing to do with the concept of rain).

Something like this fails:

find . -name '*rain*' ! -name '*brain*'| more

And I'm having no luck with even getting started on building a successful regex variant because I cannot wrap my mind around regex ... for instance, this doesn't do anything:

# this is incomplete, just a stub of where I was going
# -type f also still includes a directory name
find . -regextype findutils-default -iregex '\(*rain*\)' -type f  

Any help would be greatly appreciated. If I could see a regex command that does everything I want it to do, with an explanation of each character in the command, it would help me learn more about regex with the find command in general.


edit 1:

Taking cues from all the feedback so far from jhnc and Seth Falco, I have tried this:

find . -type f | grep -Pi '(?<![a-zA-Z])rain'

I think this pretty much works (I don't think it is missing anything), my only issue with it is that it also matches on occurrences of "rain" further up the path, not only in the file name. So I get example output like this:

./Linux/path/to/radiohead - 2007 - in rainbows/09 Jigsaw Falling Into Place.mp3

Since "rain" is not in the filename itself, this is a result I'd rather not see. So I tried this:

find . -type f -printf '%f\n' | grep -Pi '(?<![a-zA-Z])rain'

That does ensure that only filenames are matched, but it also does not output the paths to the filenames, which I would still like to see, so I know where the file is.

So I guess what I really need is a PCRE (PCRE2 ?) which can take the seemingly successful look-behind method, but only apply it after the last path delimiter (/ since I am on Linux), and I am still stumped.

derrgill
  • 15
  • 5
  • "Something like this fails." In what way does it fail? Folders? Prefix with `-type f`: `find . -type f -name '*rain*' ! -name ...` – jhnc Jun 03 '22 at 22:31
  • I mean it fails in that it still includes "brain", "brains", et al, when I was trying to exclude this ... the only thing I want to include before "rain" would be a space if it was not the beginning characters of the filename ... possibly a hyphen, but all of my filenames should have a hyphen (bookended by spaces) between artist and title – derrgill Jun 04 '22 at 03:09
  • `find -type f -regextype grep -iregex '.*\brain[^/]*'` – jhnc Jun 04 '22 at 03:32
  • `find -type f -regextype grep -iregex '.*[/ _]rain[^/]*'` – jhnc Jun 04 '22 at 03:35
  • 1
    The first suggestion with `\b` isn't ideal since `something_rain.mp3` wouldn't match. The 2nd suggestion with `[/ _]` would work, but the character class needs to be extended, it'd probably be easier to exclude alpha characters than maintain the list of other characters, however. – Seth Falco Jun 04 '22 at 10:47
  • @jhnc thank you these were both helpful, I saved the output of both to files then diffed them in Meld to see what was caught by one and not by the other (apparently I have a few file names with a . or - preceding "rain" without spaces, which the first command variant caught, and as Seth pointed out I had some files witth _ that were not caught). Would you mind walking me through the two commands? In the first one for instance why is only "brain" specified, not also/or, say, "train" or "drain"? How does each piece of syntax work in both commands? – derrgill Jun 04 '22 at 12:37
  • If you read the `grep` man-page it give the complete syntax understood. In particular, `A bracket expression is a list of characters enclosed by [ and ], It matches any single character in that list. If the first character of the list is the caret ^ then it matches any character not in the list` and `\b matches the empty string at the edge of a word` (where I believe `word` has a technical definition which unfortunately includes `_` (viz. `[_[:alnum:]]+` - where `[:x:]` denotes a pre-defined "character class" x) ). – jhnc Jun 04 '22 at 14:58
  • @jhnc thank you for all your helpful comments, I've edited my original question to show my current progress, and what I'd still like to accomplish. – derrgill Jun 05 '22 at 16:43

2 Answers2

1

specification:

  1. match "rain"
  2. in filename
  3. only at start of a word
  4. case-insensitive

assumptions:

  1. define "word" to be sequence of letters (no punctuation, digits, etc)
  2. paths have form prefix/name where prefix can have one or more levels delimited by / and name does not contain /

constraints:

  1. find -iregex matches against entire path (-name only matches filename)
  2. find -iregex must match entirety of path (eg. "c" is only a partial match and does not match path "a/b/c")

method:

find can return matches against non-files (eg. directories). Given definition 6, we would be unable to tell if name is a directory or an ordinary file. To satisfy 2, we can exclude non-files using find's -type f predicate.

We can compare paths found by find against our specification by using find's case-insensitive regex matching predicate (-iregex). The "grep" flavour (-regextype grep) is sufficiently expressive.

Just using 1, a suitable regex is: rain

2+6+7 says we must forbid / after "rain": rain[^/]*$

  • [/] matches character in set (ie. /)
  • [^/]: ^ inverts match: ie. character that is not /
  • * matches preceding match zero or more times
  • $ constrains preceding match to occur at end of input

3+5 says there must be no immediately preceding word characters: [^a-z]rain[^/]*$

  • a-z is a shortcut for the range a to z

8 requires matching the prefix explicitly: ^.*[^a-z]rain[^/]*$

  • ^ outside of [...] constrains subsequent match to occur at beginning of input
  • . matches anything
  • [^a-z] matches a non-alphabetic

Final command-line:

find . -type f -regextype grep -iregex '^.*[^a-z]rain[^/]*$'

Note: The leading ^ and trailing $ are not actually required, given 8, and could be elided.


exercise for the reader:

  1. extend "word" to non-ASCII characters (eg. UTF-8)
jhnc
  • 11,310
  • 1
  • 9
  • 26
0

You probably want to use either a character class, word boundary, or just have a negative look behind for alpha characters.

Look Behind

^.+(?<![a-zA-Z])rain[^\/]*$

Matches any instance of rain, but only if it's not following [a-zA-Z], and doesn't have any slashes afterwards. Unfortunately, find doesn't support look ahead or look behind… so we'll use a character class instead.

Character Class

^.+(?:^|[^a-zA-Z])rain[^\/]*$

Matches the start of the line, or a character that isn't [a-zA-Z], then proceeds to match by the characters for rain if it comes immediately after, so long as there are no slashes afterwards.

You can use it in find like this:

find ./ -iregex '.+(?:^|[^a-zA-Z])rain[^\/]*'

The ^ at the start and $ at the end of the pattern are implied when using find with -iregex, so you can omit them.

Seth Falco
  • 313
  • 6
  • 22
  • Perhaps worth noting this uses [findutils default regex type](https://www.gnu.org/software/findutils/manual/html_node/find_html/findutils_002ddefault-regular-expression-syntax.html). Also, that patterns need to completely match the entire path (`find -name` matches basename not full path, and `grep` etc, normally match if just a substring matches) – jhnc Jun 04 '22 at 15:07
  • I couldn't get ```find ./ -iregex '.*(?:^|[^a-zA-Z])rain.*'``` to work, but I did try the look-behind method by piping find to grep (I edited my original question to show more detail on this). – derrgill Jun 05 '22 at 16:42
  • @derrgill Didn't notice your comment until now. Edited my answer, basically just adds `[^\/]*` to the end of it, which ensures after matching `rain` that there are no directory separators after it. – Seth Falco Jul 17 '22 at 16:25