Split string on arbitrary-length substrings (Powershell)

Question

I have formatted text files from other sources; I can't control those sources or ask them to generate a more sensible-for-my-purposes format like CSV. I can look at the header lines of the files to determine the column widths (and names, but they're not at issue here). Once I've done that, I'll have an array of widths. I'd like to be able to split subsequent lines in that file based on the widths I've determined from the header.

Obviously, I can loop through the array of widths, and bite off the initial substring of the appropriate length, but I'm hoping there's a more efficient way - for example, if I wanted to use fixed-width columns, I could just use -split "(\w{$foo})", where $foo is the variable that contains the width of the column.

Is there, in fact, a more efficient way of doing this?

Example data:

Junction      0122 D150441-4    Ni Po De           210 Na

Column widths $cols=@(14, 5, 11, 2, 16, 3, 4, 2)

(Note: I don't care about trailing spaces in the chopped-up data; I can manage those later. I'm simply looking to chop the data at this point.)

(At iRon's request to be able to demonstrate his ConvertFrom-SourceTable, this is a full file that might need to be parsed)

@SUB-SECTOR: sec_C   SECTOR: reft
#
# Trade routes within the subsector
#
#--------1---------2---------3---------4---------5---------6---
#PlanetName   Loc. UPP Code   B   Notes         Z  PBG Al LRX *
#----------   ---- ---------  - --------------- -  --- -- --- -
Lemente       1907 B897563-B    Ag Ni              824 Na
Zamoran       2108 B674675-A  Q Ag Ni              904 Dr

It is not quiet clear were you looking for but the complete solution might be here: https://stackoverflow.com/a/50648461/1701026 — iRon, Apr 17 '19 at 19:54
@iRon - that link will be very useful, although I don't always have such perfect headers. Thank you! — Jeff Zeitlin, Apr 17 '19 at 20:00
You don't need perfect headers for the [`ConvertFrom-SourceTable`](https://www.powershellgallery.com/packages/ConvertFrom-SourceTable) cmdlet. — iRon, Apr 17 '19 at 20:41
Besides you can define your own ruler or refine your header. If you paste a complete table example, I can show the results in an answer. — iRon, Apr 18 '19 at 05:53

Mathias R. Jessen · Accepted Answer · 2019-04-18T08:57:03.743

Is there, in fact, a more efficient way of doing this?

If by "more efficient", you mean "something that takes fewer CPU cycles", then yes:

$string = 'Junction      0122 D150441-4    Ni Po De           210 Na'
$cols = @(14, 5, 11, 2, 16, 3, 4, 2)
$substrings = @(
  $cols |Select -SkipLast 1 |ForEach-Object {
    $string.Remove($_)
    $string = $string.Substring($_)
  }
  $string
)

# $substrings now contain the individual column values

The code above will grab the first n-1 substrings by continuously removing them from the previous copy of the string.

If by "more efficient" you mean "less code", you can concatenate your constructed regex patterns and grab all capture groups in one go:

$string = 'Junction      0122 D150441-4    Ni Po De           210 Na'
$cols = @(14, 5, 11, 2, 16, 3, 4, 2)

# generate the regex pattern 
# in this case '(.{14})(.{5})(.{11})(.{2})(.{16})(.{3})(.{4})(.{2})'
$pattern = $cols.ForEach({"(.{$_})"})-join''

# use `-match` and $Matches to grab the individual groups
$substrings = if($string -match $pattern){
  $Matches[1..($cols.Length-1)]
}

# $substrings again holds all our substrings

I haven't finished analysing these to understand fully _why_ they work, but they do work; thank you! Have some rep! — Jeff Zeitlin, Apr 17 '19 at 19:02

iRon · Answer 2 · 2019-10-07T08:37:51.097

`ConvertFrom-SourceTable`

$Text = @'
@SUB-SECTOR: sec_C   SECTOR: reft
#
# Trade routes within the subsector
#
#--------1---------2---------3---------4---------5---------6---
#PlanetName   Loc. UPP Code   B   Notes         Z  PBG Al LRX *
#----------   ---- ---------  - --------------- -  --- -- --- -
Lemente       1907 B897563-B    Ag Ni              824 Na
Zamoran       2108 B674675-A  Q Ag Ni              904 Dr
'@

You might use the raw here string as input for the ConvertFrom-SourceTable cmdlet but if the data is retrieved from a file, using Get-Content, the $Table will likely be an array of string (lines):

$Table = $Text -Split "[\r\n]+"

If you header never changes, it would be easiest to redefine the header row and the ruler using the -Header and the -Ruler parameter:

$Table | Select -Skip 7 | ConvertFrom-SourceTable `
    -Header 'PlanetName    Loc. UPP Code   B   Notes         Z  PBG Al LRX *' `
    -Ruler  '----------    ---- ---------  - --------------- -  --- -- --- -' `
    | Format-Table


PlanetName Loc. UPP Code  B Notes Z PBG Al LRX *
---------- ---- --------  - ----- - --- -- --- -
Lemente    1907 B897563-B   Ag Ni   824 Na
Zamoran    2108 B674675-A Q Ag Ni   904 Dr

_{(Btw. the -Ruler parameter isn't really required here and could be left out for this specific table)}

If the header is different for every table, you might consider to automatically reformat your table a bit and remove the # from the header and ruler lines and replace it with a space:

$Table | Select -Skip 5 |
    ForEach-Object {$_ -Replace '^#', ' '} |
        ConvertFrom-SourceTable | Format-Table

The first column is not completely correct aligned but this will be straightened by the data that follows. There is one exception for this: especially when the input is provided in a stream^[1], data that is right aligned with the header/ruler (usually integers as in the example below) will be interpreted by the ConvertFrom-SourceTable cmdlet by default.

_{[1] If the input is provided as pipeline stream (rather than a raw here string), the ConvertFrom-SourceTable cmdlet will function as an in the middle of a pipeline cmdlet and intermediately release each object for the next cmdlet. Therefore it will only be able to determine column connections and adjustments up to the current line.}

ConvertFrom-SourceTable '
 PlanetName Loc. UPP Code  B Notes Z PBG Al LRX *
 ---------- ---- --------  - ----- - --- -- --- -
12345789012 1907 B897563-B   Ag Ni   824 Na
        123 2108 B674675-A Q Ag Ni   904 Dr
' | Format-Table

_{(Note here that the above table input is exactly the same as the Format-Table output)}

In other words, if the field at the first row at the first column is a string that is right aligned with the header (12 characters as in the above example), it will produce an error if it can't be interpreted (e.g. if it is not a number). You can avoid this using the -Literal switch.

Conclusion

I guess this command will do the whole trick using the ConvertFrom-SourceTable cmdlet:

$Table | Select -Skip 5 |
    ForEach-Object {$_ -Replace '^#', ' '} |
        ConvertFrom-SourceTable -Literal | Format-Table

Update (3 may 2019)

I have added a new feature to the ConvertFrom-SourceTable which might come at hand for floating tables like this one:

-Floating
By default introductions in floating tables with a ruler that are not streamed through the pipeline are automatically skipped.
If the -Floating switch is provided for for a pipeline input, the streaming of objects will start at the ruler (streamed floating tables can't be rulerless).
If the floating is explicitly disabled (-Floating:$False), the header is presumed to be on the first line, even if the table is not streamed.

This mean that you can simplify the command to and do not have to -Skip to a certain row anymore:

ConvertFrom-SourceTable -Literal ($Table | ForEach-Object {$_ -Replace '^#', ' '})

If you want to stream the data from a (large) input file, you will need to provide the -Floating switch to tell the cmdlet to wait for the ruler:

$Table | ForEach-Object {$_ -Replace '^#', ' '} |
    ConvertFrom-SourceTable -Literal | Format-Table

Update (6 october 2019)

I have updated the ConvertFrom-SourceTable cmdlet.
Although the -Markdown and the -Floating parameters have been depleted, the cmdlet still supports markdown and floating tables. These features can be enforced by explicitly setting the
-HorizontalDash (alias -HDash) and -VerticalDash (alias -VDash) parameters (see Help -Full ConvertFrom-SourceTable.

For this specific question:
If it concerns a raw table (complete text table with lines separated by newlines):

PS C:\> ConvertFrom-SourceTable ($Text -Replace '#', ' ') | Format-Table

PlanetName Loc. UPP Code  B Notes Z PBG Al LRX *
---------- ---- --------  - ----- - --- -- --- -
Lemente    1907 B897563-B   Ag Ni   824 Na
Zamoran    2108 B674675-A Q Ag Ni   904 Dr

And in case you like to stream the input (and output), you will need to define where the table starts by explicitly setting the horizontal dash character (-Hash '-'):

PS C:\> $Text -Split "[\r\n]+" | ForEach-Object {$_ -Replace '^#', ' '} | 
        ConvertFrom-SourceTable -HDash '-' | Format-Table

PlanetName Loc. UPP Code  B Notes Z PBG Al LRX *
---------- ---- --------  - ----- - --- -- --- -
Lemente    1907 B897563-B   Ag Ni   824 Na
Zamoran    2108 B674675-A Q Ag Ni   904 Dr

For processing these files - and they're not the only ones, but they're good representatives - I'd have to watch the data to determine how many lines to skip - there may be an arbitrary number of lines between `# Trade routes...` and `#Planet...`. This looks like a useful cmdlet to have in my library, however; thank you! — Jeff Zeitlin, Apr 19 '19 at 12:14
The cmdlet was mainly focused on automatically recognizing columns, but if there is a clear ruler provided, it is actually possible to recognize a table that is starts later at a specific row in a here string... This means you do not have to do have to `-Skip` to the table any more, see the **update** in my answer. (note that the `ForEach-Object {$_ -Replace '^#', ' '}` is still required. It is to specific to incorporate in a general cmdlet) — iRon, May 03 '19 at 10:54

Split string on arbitrary-length substrings (Powershell)

2 Answers2

`ConvertFrom-SourceTable`

Conclusion

Update (3 may 2019)

Update (6 october 2019)

Linked