1

I saw something weird today in the behaviour of the Bash Shell when globbing.

So I ran an ls command with the following Glob:

ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]* | grep ":"

the result was as expected

GM12878_Hs_InSitu_MboI_rE1_TagDirectory:
GM12878_Hs_InSitu_MboI_rE2_TagDirectory:
GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rG1_TagDirectory:
GM12878_Hs_InSitu_MboI_rG2_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:

however when I change the same regex by introducing an underscore to this

ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]_* | grep ":"

my expected result is the complete set as shown above, however what I get is a subset:

GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:

Can someone explain what's wrong in my logic when I introduce an underscore sign before the asterisk?

I am using Bash.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
FoldedChromatin
  • 217
  • 1
  • 4
  • 12

3 Answers3

4

You misunderstand what your glob is doing.

You were expecting this:

GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]*

to be a glob of files that have any of those comma-separated segments but that's not what [] globbing does. [] globbing is a character class expansion.

Compare:

$ echo GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]
GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]

to what you were trying to get (which is brace {} expansion):

$ echo GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}
GM12878_Hs_InSitu_MboI_rE1 GM12878_Hs_InSitu_MboI_rE2 GM12878_Hs_InSitu_MboI_rF GM12878_Hs_InSitu_MboI_rG1 GM12878_Hs_InSitu_MboI_rG2 GM12878_Hs_InSitu_MboI_rH

You wanted that latter expansion.

Your expansion uses a character class which matches the character E-H, 1-2, and ,; it's identical to:

GM12878_Hs_InSitu_MboI_r[EFGH12,]_*

which, as I expect you can now see, isn't going to match any two character entries (where the underscore-less version will).

Etan Reisner
  • 77,877
  • 8
  • 106
  • 148
0

* in fileystem globs is not like * in regex. In a regex * means "0 or more of the preceeding pattern," but in filesystem globs it means "anything at all of any size". So in your first example, the _ is just part of the "anything" from the * but in the second you're matching any single character within your character class (not the patterns you seem to be trying to define) followed by _ followed by anything at all.

Also, character classes don't work the way you're trying to use them. [...] will match any character within the brackets, so your pattern is actually the same as [EFGH12,] since those are all the letters in class you define.

To get the grouping of patterns you want, you should use { instead of [ like

ls GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}_* | grep ":"
Eric Renouf
  • 13,950
  • 3
  • 45
  • 67
-1

As far as I know, and this article supports my me, the square brackets don't work as a choice but as a character set, so using [E1,E2,F,G1,G2,H] actually is equivalent to exactly one occurrence of [EGHF12,]. You can then interpret the second result as "one character of EGHF12, and an underscore", which matches GM12878_Hs_InSitu_MboI_rF_TagDirectory: but not GM12878_Hs_InSitu_MboI_rG1_TagDirectory: (there is the r followed by more that "one occurrence of...").

The first regex works because you used the asterisk, which matches what is missed by the wrong [...].

A correct expression would be:

ls GM12878_Hs_InSitu_MboI_r{E1|E2|F|G1|G2|H}* | grep ":"
ColOfAbRiX
  • 1,039
  • 1
  • 13
  • 27