0

I need to extract only the numbers from external file. I am using the following command:

(Get-Content -Path .\log.html) | Select-String -Pattern 'load is'

It then returns:

<tr><td>server-67 load is: 0</td></tr>
<tr><td>server-68 load is: 5875</td></tr>
<tr><td>server-69 load is: 6077</td></tr>
<tr><td>server-70 load is: 6072</td></tr>
<tr><td>server-71 load is: 5846</td></tr>
<tr><td>server-72 load is: 1900</td></tr>
<tr><td>server-73 load is: 1900</td></tr>

I only need to extract the number portion. How can I do it?

catalin
  • 946
  • 6
  • 14
  • 31
  • 1
    You can adapt Keith Hill's solution combined with a [look-behind](http://www.powershellcookbook.com/recipe/qAxK/appendix-b-regular-expression-reference): `(Get-Content -Path .\log.html) | Select-String -Pattern '(?<=load is: )(\d+)' -All | Select Matches`. Although, does "**only the numbers**" include the server numbers? – sshine Apr 04 '18 at 10:49
  • @SimonShine: A near-duplicate, for sure, but there's a twist (matching via surrounding in-line context) that makes it a distinct question. Note that `Select Matches` outputs _custom_ objects with a `.Matches` property, which is not the same as outputting just the captured numbers. Also, there is no need for `-All`, unless you want to capture multiple matches _per line_. – mklement0 Apr 04 '18 at 12:02

2 Answers2

7

What distinguishes this question from the near-duplicate at "How do I return only the matching regular expression when I select-string(grep) in PowerShell?" is the desire to extract substrings of interest via surrounding in-line context not to be included in the match:

PS> Select-String '(?<=load is: )\d+' .\log.html | ForEach-Object { $_.Matches[0].Value }
0
5875
6077
6072
5846
1900
1900

If you want to output actual numbers, simply place [int] (for instance) before $_.Matches[0].Value to cast (convert) the text results to an integer.

  • Select-String can accept file paths directly, so for a single file or a group of files matched by a wildcard expression you generally don't need to pipe from Get-Content.
    (For processing entire directory subtrees, pipe from Get-ChildItem -File -Recurse).

  • Regex '(?<=load is: )\d+' uses a (positive) lookbehind assertion ((?<=...)) to match part of each line without including what was matched in the result; only the \d+ part - a nonempty run of digits - is captured.

  • Select-String outputs [Microsoft.PowerShell.Commands.MatchInfo] instances whose .Matches property contains the results of regex matching operation; its .Value property contains what the regex captured.


In the case at hand, the lookbehind solution is probably simplest, but an alternative solution is to use a capture group, which is ultimately more flexible:

# Same output as above.
Select-String 'load is: (\d+)' .\log.html | ForEach-Object {$_.Matches[0].Groups[1].Value}

What the capture group (the parenthesized subexpression, (...)) matched is available on the output objects' .Matches.Groups collection, whose element at index 0 contains the overall match, and element 1 containing the 1st capture groups, 2 the 2nd, and so on.

mklement0
  • 382,024
  • 64
  • 607
  • 775
1

Here's one possibility:

(Get-Content -Path .\log.html) |
    Where-Object {$_ -match '^<tr><td>server-(?<Server>\d{1,}) load is: (?<load>\d{1,})</td></tr>$'} |
        ForEach-Object {
            [PsCustomObject]@{"ServerNumber"=$matches.Server;"ServerLoad"=$matches.Load}
        }

This will give you output like this:

ServerNumber ServerLoad
------------ ----------
67           0         
68           5875      
69           6077      
70           6072      
71           5846      
72           1900      
73           1900    
boxdog
  • 7,894
  • 2
  • 18
  • 27