2

I have the following code:-

'\u0026' -replace '(\u)(\d{4})', '$$([char]0x$2)'

That will obviously result with:-

$([char]0x0026)

If I make the RegEx substitution into an expandable string with:-

'\u0026' -replace '(\\u)(\d{4})', "$([char]0x`${2})"

Then I will get:-

Unexpected token '0x`$' in expression or statement.

If I simplify things to:-

'\u0026' -replace '(\\u)(\d{4})', "0x`${2}"

Then I can get:-

0x0026

But, what I want is to cast that '0x0026' to a char so it replaces '\u0026' to '&'. However, it seems impossible to pass a RegEx substituted token to a PowerShell subexpression in this way. If you separate the two languages with:-

'\u0026' -replace '(\\u)(\d{4})', "$([char]0x0026) 0x`${2}"

Then the below will result:-

& 0x0026

Which is great as it shows PowerShell subexpressions do work in RegEx substitutions as the converted ampersand shows.

I am new to RegEx. Have I hit my limit already?

sulligogs
  • 23
  • 3
  • *"That will obviously result with: `$([char]0x0026)`"* - nope, that results in an error "The regular expression pattern (\u)(\d{4}) is not valid.". – Tomalak May 08 '21 at 09:01
  • 3
    Maybe you should explain what you are trying to do, not how you are trying to do it. (See: [What the XY problem?](https://xyproblem.info/)) – Tomalak May 08 '21 at 09:01
  • '\u0026' -replace '(\\u)(\d{4})', '$$([char]0x$2)' - I missed out the double backslash to escape the "\u" I wish to convert any instances of "\u" in a file to its corresponding character. – sulligogs May 08 '21 at 09:08
  • You're explaining *how* you are trying to solve an undisclosed problem. Tell me more about the problem instead of your attempted solution. (My current suspicion is that you are trying to parse JSON.) – Tomalak May 08 '21 at 09:10
  • The problem is on our intranet there are many webpages with \u embedded and I want to convert those instances into their respective characters to feed and parse those webpages into a Powershell script. – sulligogs May 08 '21 at 09:16
  • 1
    Ah, so you're trying to modify HTML source code in files? Could you include a sample of such a file in the question? – Tomalak May 08 '21 at 09:21
  • 2
    I completely agree with @Tomalak, the question as presented is an [XY problem](https://en.wikipedia.org/wiki/XY_problem). To get out of this `XY` loop ask yourself **WHY???** (which every definition in the question)? As: why do you want to "`& 0x0026`"? (and add that information to the question). I guess you simply want to do this: `[Regex]::Unescape('Jack\u0026Jill')`. But even that is a questionable answer as it is usually not required to [unescape](https://learn.microsoft.com/dotnet/api/system.text.regularexpressions.regex.unescape) a regulair expression... – iRon May 08 '21 at 09:53
  • iRon - that's what I needed. I suppose the answer to my question, "How do I pass a RegEx token to a PowerShell subexpression in a RegEx substitution?" is that it can't be done. If you leave your answer I'll mark it for you. Thanks again. – sulligogs May 08 '21 at 11:39
  • 1
    We're still not solving your actual issue, but merely the symptom of it. There is no reason why `\u0026` would even be in HTML, unless something goes wrong while producing the HTML (then this should be fixed), or it's in JSON strings (then a JSON parser should be used). Replacing these escape sequences via regex is possible, but it does not at all seem like the thing that you actually need. – Tomalak May 08 '21 at 13:37
  • 1
    @Tomalak - you're right and there won't be any plans in the future to fix it either, but that's out of my hands. The intranet pages are Sharepoint ones and I've read somewhere that certain characters will get escaped in this manner. Apologies for not explaining properly the background scenario, but I really appreciate everyone's input on this. As a first OP from myself on this site I'll learn my mistakes and be clearer next time. – sulligogs May 08 '21 at 15:43

3 Answers3

4

Apperently, you want to unescape an escaped regular expression. You can do this using the .net [regex] unescape method:

[Regex]::Unescape('Jack\u0026Jill')

Yields:

Jack&Jill
iRon
  • 20,463
  • 10
  • 53
  • 79
2

There's a way in powershell 7, where -replace's 2nd arg can be a scriptblock. Getting the 2nd matching group takes a bit more doing using $_:

'\u0026' -replace '(\\u)(\d{4})', { $b = $_ }
$b.groups

Groups   : {0, 1, 2}
Success  : True
Name     : 0
Captures : {0}
Index    : 0
Length   : 6
Value    : \u0026

Success  : True
Name     : 1
Captures : {1}
Index    : 0
Length   : 2
Value    : \u

Success  : True
Name     : 2
Captures : {2}
Index    : 2
Length   : 4
Value    : 0026


'\u0026' -replace '(\\u)(\d{4})', { [char][int]('0x' + $_.groups[2]) }

&

Note that \d won't match all hex numbers. ([[:xdigit:]] doesn't work.)

'\u002b' -replace '(\\u)([0-9a-f]{4})', { [char][int]('0x' + $_.groups[2]) }

+
js2010
  • 23,033
  • 6
  • 64
  • 66
1

Use a scriptblock substitution (6.2 and up):

'\u0026' -replace '(\\u)(\d{4})', {"0x$($_.Groups[2].Value)"}

In earlier versions of PowerShell you can do the same by calling [Regex]::Replace():

[regex]::Replace('\u0026', '(\\u)(\d{4})', {param($m) "0x$($m.Groups[2].Value)"})

In both cases, the block will act as a callback for every single match, allowing you to construct the replacement string after getting access to the matched substring(s), but before the substitution takes place:

PS ~> [regex]::Replace('\u0026', '(\\u)(\d{4})', {param($m) "0x$($m.Groups[2].Value)"})
0x0026
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206