-2

I want to do a string replacement on any string that is surrounded by a word boundary that is alphanumeric and is 14 characters long. The string must contain at least one capitalized letter and one number. I know (I think I know) that I'll need to use positive look ahead for the capitalized letter and number. I am sure that I have the right regex pattern. What I don't understand is why sed is not matching. I have used online tools to validate the pattern like regexpal etc. Within those tools, I am matching the string like I expect.

Here is the regex and sed command I'm using.

\b(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{14}\b

The sed command I'm testing with is

echo "asdfASDF1234ds" | sed 's/\b(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{14}\b/NEW_STRING/g'

I would expect this to match on the echoed string.

tripleee
  • 175,061
  • 34
  • 275
  • 318
herkimer
  • 1
  • 3

3 Answers3

0

sed understands a very limited form of regex. It does not have lookahead.

Using a tool with more powerful regex support is the simple solution.

If you must use sed, you could do something like:

$ sed '
    # mark delimiters
    s/[^a-zA-Z0-9]\{1,\}/\n&\n/g
    s/^[^\n]/\n&/
    s/[^\n]$/&\n/

    # mark 14-character candidates
    s/\n[a-zA-Z0-9]\{14\}\n/\n&\n/g

    # mark if candidate contains capital
    s/\n\n[^\n]*[A-Z][^\n]*\n\n/\n&\n/g

    # check for a digit; if found, replace
    s/\n\n\n[^\n]*[0-9][^\n]*\n\n\n/NEW_STRING/g

    # remove marks
    s/\n//g
' <<'EOD'
a234567890123n
,a234567890123n,
xx,a234567890123n,yy
a23456789A123n
XX,a23456789A123n,YY
xx,a23456789A1234n,yy
EOD
a234567890123n
,a234567890123n,
xx,a234567890123n,yy
NEW_STRING
XX,NEW_STRING,YY
xx,a23456789A1234n,yy
$
jhnc
  • 11,310
  • 1
  • 9
  • 26
0

This might work for you (GNU sed):

sed -E 's/\<[A-Za-z0-9]{14}\>/\n&\n/
        s/\n.*(([A-Z].*[0-9])|([0-9].*[A-Z])).*\n/NEW_STRING/
        s/\n//g' file    

Isolate a 14 alphanumeric word by delimiting it with newlines.

If the string between the newlines contains at least one uppercase alpha character and at least one numeric character, replace the string and its delimiters by NEW_STRING.

Remove the delimiters.

Or if multiple strings, perhaps:

sed -E 's/\b/\n/g
        s#.*#echo "&"|sed -E "/^[a-z0-9]{14}$/I{/[A-Z]/{/[0-9]/s/.*/NEW_STRING/}}"#e
        s/\n//g' file
potong
  • 55,640
  • 6
  • 51
  • 83
  • I don't think you can use `.`. Maybe `[^\n]`. eg: `a,A234567890ABCD,,A234567890ABCD,`. Also need `s///g` everywhere – jhnc Dec 25 '22 at 22:13
  • You definitely answered the question as stated. I should rewrite it but the string may be preceded by a colon. I only want to match if there is or isn't a colon but no other non-alnum chars. – herkimer Dec 26 '22 at 19:11
  • @potong my suggestions don't actually fix the problem either. eg: `xx,a234567890123n,a23456789A1234o,a23456789A123n,yy` – jhnc Dec 27 '22 at 16:41
  • 1
    @jhnc thankyou I've removed that ill thought out solution. – potong Dec 27 '22 at 23:58
-1

sed doesn't support lookaheads, or many many many other modern regex Perlisms. The simple fix is to use Perl.

perl -pe 's/\b(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{14}\b/NEW_STRING/g' <<< "asdfASDF1234ds"
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I like this solution. My problem is a little more complicated than I first stated. I want to match on surrounding word boundaries, but not if there is a leading non-alphanumeric character immediately preceding it except for a colon. I have not been able to find the right combination of look ahead and exceptions. Any idea how I could do that as well? – herkimer Dec 23 '22 at 19:23
  • The lookbehind `(?<![^\w:])` sounds like what you seem to be asking. – tripleee Dec 23 '22 at 19:25
  • @tripleee this regex seems slightly broken. consider: `xx,a234567890123n,a23456789A1234o,a23456789A123n,yy` – jhnc Dec 27 '22 at 14:54
  • I can't know what the OP's input looks like. I guess probably replace `.` with `\w` everywhere for better precision. – tripleee Dec 27 '22 at 15:50
  • @tripleee yes, that seems to fix it. Basic issue is that the lookahead is not constrained to the 14 characters of interest. Needs somehow to be told to not search beyond the second `\b`. – jhnc Dec 27 '22 at 16:54