Substitution of characters limited to part of each input line

Question

Have a file eg. Inventory.conf with lines like:

Int/domain—home.dir=/etc/int

I need to replace / and — before the = but not after. Result should be:

Int_domain_home_dir=/etc/int

I have tried several sed commands but none seem to fit my need.

Welcome to Stack Overflow! Please [edit] your question to show [what you have tried so far](http://whathaveyoutried.com). You should include a [mcve] of the code that you are having problems with, then we can try to help with the specific problem. You should also read [ask]. — Toby Speight, Nov 03 '16 at 17:48

mklement0 · Answer 1 · 2016-11-07T07:34:22.640

You're asking for a sed solution, but an awk solution is simpler and performs better in this case, because you can easily split the line into 2 fields by = and then selectively apply gsub() to only the 1st field in order to replace the characters of interest:

$ awk -F= '{ gsub("[./-]", "_", $1); print $1 FS $2 }' <<< 'Int/domain-home.dir=/etc/int'
Int_domain_home_dir=/etc/int

-F= tells awk to split the input into fields by =, which with the input at hand results in $1 (1st field) containing the first half of the line, before the =, and $2 (2nd field) the 2nd half, after the =; using the -F option sets variable FS, the input field separator.
gsub("[./-]", "_", $1) globally replaces all characters in set [./-] with _ in $1 - i.e., all occurrences of either ., / or - in the 1st field are replaced with a _ each.
print $1 FS $2 prints the result: the modified 1st field ($1), followed by FS (which is =), followed by the (unmodified) 2nd field ($2).

Note that I've used ASCII char. - (HYPHEN-MINUS, codepoint 0x2d) in the awk script, even though your sample input contains the Unicode char. — (EM DASH, U+2014, UTF-8 encoding 0xe2 0x80 0x94).
If you really want to match that, simply substitute it in the command above, but note that the awk version on macOS won't handle that properly.

Another option is to use iconv with ASCII transliteration, which tranlates the em dash into a regular ASCII -:

iconv -f utf-8 -t ascii//translit <<< 'Int/domain—home.dir=/etc/int' |
  awk -F= '{ gsub("[./-]", "_", $1); print $1 FS $2 }'

perl allows for an elegant solution too:

$ perl -F= -ane '$F[0] =~ tr|-/.|_|; print join("=", @F)' <<<'Int/domain-home.dir=/etc/int'
Int_domain_home_dir=/etc/int

-F=, just like with Awk, tells Perl to use = as the separator when splitting lines into fields
-ane activates field splitting (a), turns off implicit output (n), and e tells Perl that the next argument is an expression (command string) to execute.
The fields that each line is split into is stored in array @F, where $F[0] refers to the 1st field.
$F[0] =~ tr|-/.|-| translates (replaces) all occurrences of -, /, and . to _.
print join("=", @F) rebuilds the input line from the fields - with the 1st field now modified - and prints the result.

Depending on the Awk implementation used, this may actually be faster (see below).

That sed isn't the best tool for this job is also reflected in the relative performance of the solutions:

Sample timings from my macOS 10.12 machine (GNU sed 4.2.2, Mawk awk 1.3.4, perl v5.18.2, using input file file, which contains 1 million copies of the sample input line) - take them with a grain of salt, but the ratios of the numbers are of interest; fastest solutions first:

# This answer's awk answer.
# Note: Mawk is much faster here than GNU Awk and BSD Awk.
$ time awk -F= '{ gsub("[./-]", "_", $1); print $1 FS $2 }' file >/dev/null
real    0m0.657s

# This answer's perl solution:
# Note: On macOS, this outperforms the Awk solution when using either
#       GNU Awk or BSD Awk.
$ time perl -F= -ane '$F[0] =~ tr|-/.|_|; print join("=", @F)' file >/dev/null
real    0m1.656s

# Sundeep's perl solution with tr///
$ time perl -pe 's#^[^=]+#$&=~tr|/.-|_|r#e' file >/dev/null
real    0m2.370s

# Sundeep's perl solution with s///
$ time perl -pe 's#^[^=]+#$&=~s|[/.-]|_|gr#e' file >/dev/null
real    0m3.540s

# Cyrus' solution.
$ time sed 'h;s/[^=]*//;x;s/=.*//;s/[/.-]/_/g;G;s/\n//' file >/dev/null
real    0m4.090s

# Kenavoz' solution.
# Note: The 3-byte UTF-8 em dash is NOT included in the char. set,
#       for consistency of comparison with the other solutions.
#       Interestingly, adding the em dash adds another 2 seconds or so.
$ time sed ':a;s/[-/.]\(.*=\)/_\1/;ta' file >/dev/null
real    0m9.036s

As you can see, the awk solution is fastest by far, with the line-internal-loop sed solution predictably performing worst, by a factor of about 12.

score 2 · Accepted Answer · edited Sep 23 '17 at 16:24

2

Sed with a t loop (BRE):

$ sed ':a;s/[-/—.]\(.*=\)/_\1/;ta;' <<< "Int/domain—home.dir=/etc/int"
Int_domain_home_dir=/etc/int

When one of the -/—. character is found, it's replaced with a _. Following text up to = is captured and output using backreference. If the previous substitution succeeds, the t command loops to label :a to check for further replacements.

Edit:

If you're under BSD/Mac OSX (thanks @mklement0):

sed -e ':a' -e 's/[-/—.]\(.*=\)/_\1/;ta'

edited Sep 23 '17 at 16:24

Graham

7,431
18
59
84

answered Nov 04 '16 at 06:19

SLePort

15,211
3
34
44

Thanks - you guys are amazing! – HoneyBee Nov 04 '16 at 08:44
While this is a clever way to simulate field-based processing in `sed`, it's also a slow one, due to using a line-internal loop, which may matter with large input sets. To make this GNU `sed` command work with BSD/macOS `sed`, use `sed -e ':a' -e 's/[-/—.]$.*=$/_\1/;ta'`. – mklement0 Nov 04 '16 at 15:59
1

@mklement0 You're right, but I'm not sure the OP needs performance with a conf file. He's probably more interested in a syntax he is familiar with. – SLePort Nov 05 '16 at 04:33
1

@mklement0 Thanks for the BSD syntax, I edited my answer. – SLePort Nov 05 '16 at 04:36
1

@Kenavoz: Thanks for updating your answer. Clearly, your answer worked for the OP, and that's great, but future readers may have differing requirements, which is why I added my performance note. – mklement0 Nov 05 '16 at 04:40

score 1 · Answer 3 · answered Nov 03 '16 at 19:23

1

With GNU sed:

echo 'Int/domain—home.dir=/etc/int' | sed 'h;s/[^=]*//;x;s/=.*//;s/[/—.]/_/g;G;s/\n//'

Output:

Int_domain_home_dir=/etc/int

See: man sed. I assume you want to replace dots too.

answered Nov 03 '16 at 19:23

Cyrus

84,225
14
89
153

tnx for example. It however did not substitute the "-". I have made a test.txt with only one line: sed "h;s/[^=]*//;x;s/=.*//;s/[—./]/_/g;G;s/\n//" test.txt this-is-_a_test=indeed-it/is.just-that – HoneyBee Nov 03 '16 at 22:54
Please disregard that first comment :-) Below is my comment: tnx for example. It however did not substitute the "-" please see below: I have made a test.txt with only one line: cat test.txt this-is/a.test=indeed-it/is.just-that Result: sed "h;s/[^=]*//;x;s/=.*//;s/[—./]/_/g;G;s/\n//" test.txt this-is_a_test=indeed-it/is.just-that Why that "-" is not substituted is beyond me :-( – HoneyBee Nov 03 '16 at 23:04
2

@HoneyBee, in your question, you have used `—` which is called emdash I think... it is different from `-` – Sundeep Nov 04 '16 at 02:03

Sundeep · Answer 4 · 2016-11-04T04:20:19.583

1

If perl solution is okay:

$ echo 'Int/domain-home.dir=/etc/int' | perl -pe 's#^[^=]+#$&=~s|[/.-]|_|gr#e'
Int_domain_home_dir=/etc/int

^[^=]+ string matching from start of line up to but not including the first occurrence of =
$&=~s|[/.-]|_|gr perform another substitution on matched string
- replace all / or . or - characters with _
- the r modifier would return the modified string
the e modifier allows to use expression instead of string in replacement section
# is used as delimiter to avoid having to escape / inside the character class [/.-]

Also, as suggested by @mklement0, we can use translate instead of inner substitute

$ echo 'Int/domain-home.dir=/etc/int' | perl -pe 's#^[^=]+#$&=~tr|/.-|_|r#e'
Int_domain_home_dir=/etc/int

Note that I've changed sample input, - is used instead of — which is what OP seems to want based on comments

edited Nov 04 '16 at 04:20

answered Nov 04 '16 at 02:10

Sundeep

23,246
2
28
103

1

++ for advanced Perl techniques, but, truthfully, even with the explanation it might be difficult to understand. Ever-so-slight simplification: `perl -pe 's/^[^=]+/$&=~tr|\/.-|_|r/e'` – mklement0 Nov 04 '16 at 03:59
1

@mklement0, thanks.. added the `tr` alternate to answer... personally, I've found `perl` to work very well for most cases, combining best of `sed`, `awk` and more, plus its regex has more options like lookarounds, skip, etc.. and compared to various `sed` and `awk` version differences, using `perl` might be easier for portability – Sundeep Nov 04 '16 at 04:14
1

Agreed in general - Perl is really powerful, and unless you're bitten by version differences, portable. But its syntax is arcane, and startup cost of the interpreter is non-trivial; depending on your use case, a good old `sed` and `awk` solution may be faster and sometimes also simpler (though you do have to be aware of platform differences there). – mklement0 Nov 04 '16 at 04:30

Substitution of characters limited to part of each input line

4 Answers4

Linked

Related