sed 's/.../...': Is it possible to store subexpressions for later use?

Question

supposing I have something like that:

echo "bLah BLaH blAH" | sed -r 's/([a-zA-Z ]+)/\L&; s/[a-z]/\u&/g'

Quite a typical use for sed to get a "crazy-case" string into mixed case (first letter uppercase, rest of letters lowercase)

However, this will always affect the WHOLE string. If I, for instance, want to parse "crazy" mp3 filenames in various flavors ($tracknr - $artist - $title vs. $artist - $tracknr - $title) things get way more complicated, because sometimes titles are in foreign languages like French and mixed case just looks BUTT-UGLY in French or Italian. That's why I only want to proceed until some delimiter is reached, e. g. space-dash-space.

Hence, I'd like to use combined 's/.../...' expressions to do things step by step. However, it would be nice to have a way to "store" subexpressions from PREVIOUS expressions, to make me able to use preserved sub-matches as source expressions for the next sed replace expression.

If you think that works OOTB anyhow, you're wrong. You simply CANNOT use '\1' syntax in the second expression after the semicolon to refer to the previous expression's subexpression (of course it works once you have defined a subexpression in the second expression itself, but this possibility not be considered now). In my case, is just unknown to the parser, and you'll get the error

sed: -e expression #1, char (xx): invalid reference \1 on `s' command's RHS

Is there anything implemented to perform that sort of thing?

Please post your expected output. – Todd A. Jacobs Jun 12 '12 at 21:35 — Todd A. Jacobs, Jun 12 '12 at 21:35
For reference: http://mywiki.wooledge.org/XyProblem – ghoti Jun 13 '12 at 02:33 — ghoti, Jun 13 '12 at 02:33

score 2 · Answer 1 · answered Jun 12 '12 at 21:38

The Problem

You want to uppercase the first letter in each word.

Your Question Makes Your Life Harder Than Necessary

You can store text in the hold space or use sequential and nested expressions to perform multiple operations on a matching pattern. You might even be able to pull some shenanigans with the hold space to re-process lines. Past a certain level of complexity, though, the real question isn't "Can language X do this?" but rather "What language is optimized for this?"

If you want to do heavy text-munging with the canonical PCRE engine and track subexpressions through complex logic, Perl is a better option. Any Turing-complete language will do, but one of the backronyms for Perl is "Pathologically Eclectic Rubbish Lister" for a reason.

The Easy GNU sed Solution

You don't need all the complexity you're asking for. Some basic GNU sed extensions will do what you want.

echo "bLah BLaH blAH" |
sed -r 's/(\b[a-zA-Z ]+\b)/\L&/g; s/\b[a-zA-Z ]/\u&/g'

This produces the desired output of uppercasing the first character of each word:

Blah Blah Blah

That may work for GNU sed, but it doesn't work in OSX, FreeBSD, NetBSD, Solaris, etc. — ghoti, Jun 12 '12 at 21:42
`sed` is Turing complete :D +1 for mentioning the _hold_ space — c00kiemon5ter, Jun 12 '12 at 21:45

c00kiemon5ter · Answer 2 · 2012-06-12T22:09:52.203

2

Assuming @CodeGnome got it right, and what you want is

You want to uppercase the first letter in each word.

you can use this alternative (which still is GNU-ism, see \L \U):

sed 's;\(.\)\([^ ]*\) \?;\U\1\L\2 ;g'

your example:

$ echo "bLah BLaH blAH" | sed 's;\(.\)\([^ ]*\) \?;\U\1\L\2 ;g'
Blah Blah Blah

if you're ok going for other solutions apart from sed you can use awk and get away with GNU-isms (thanks to dualbus on IRC)

awk '{for(i=1;i<=NF;i++){$i=toupper(substr($i,1,1))tolower(substr($i,2))}}1'

example:

$ echo "bLah BLaH blAH" | awk '{for(i=1;i<=NF;i++){$i=toupper(substr($i,1,1))tolower(substr($i,2))}}1'
Blah Blah Blah

edited Jun 12 '12 at 22:09

answered Jun 12 '12 at 21:43

c00kiemon5ter

16,994
7
46
48

or is `\L` `\U` a gnu-ism ? :S not sure.. will have to look it up – c00kiemon5ter Jun 12 '12 at 21:44
GNU sed says (emphasis mine): "Finally, as a **GNU 'sed' extension**, you can include a special sequence made of a backslash and one of the letters 'L', 'l', 'U', 'u', or 'E'." – Todd A. Jacobs Jun 12 '12 at 21:50
Thanks everybody! Think this can help me out a lot now. – syntaxerror Sep 05 '13 at 17:54

Gilles Quénot · Answer 3 · 2012-06-12T22:18:54.450

A Perl one-liner approach ;)

echo "bLah BLaH blAH" |
    perl -ne '@_ = map { ucfirst } split; print join " ", @_, $/'
BLah BLaH BlAH

That will works on any Unices I guess =)

I will decompose it :

perl         # ?! dunno =)
-n           # assume "while (<>) { ... }" loop around program
-e           # one line of program (several -e's allowed, omit programfile)
@_           # default array name
=            # what you expect
map          # take a list as argument, and perform modification. Return a list
{ ucfirst }  # modification on the list
split        # without argument, takes the current line (we use -n switch)
;            # end of the first instruction
print        # what you expect
join " ", @_ # join a space on the list
$/           # by default, a newline (see perldoc perlvar)

Yes, thanks but I think I am going to post another thread in perl section to get this done with "rename" (the command, not the perlfunc). It will contain a question about upper/lowercasing a subexpression. — syntaxerror, Jun 12 '12 at 22:16

score 1 · Answer 4 · answered Jun 12 '12 at 22:06

1

Or in awk, without the overhead of regexps:

[ghoti@pc ~]$ echo "bLah BLaH blAH" | awk 'BEGIN{RS=" ";ORS=RS} {print toupper(substr($0,1,1)) tolower(substr($0,2))}'
Blah Blah Blah

answered Jun 12 '12 at 22:06

ghoti

45,319
8
65
104

sed 's/.../...': Is it possible to store subexpressions for later use?

4 Answers4

The Problem

Your Question Makes Your Life Harder Than Necessary

The Easy GNU sed Solution