Splitting text in PowerShell using content delimiter as filename

Question

I am trying to split a txt transcription into single files, one for each folio.

The file is marked as [c. 1r],[c. 1v] ... [c. 7v] and so on.

Using this example I was able to create a PowerShell script that does the magic with a regex that match each page delimiter , but I seem totally unable to use the regex in order to give proper names to the pages. With this code

$InputFile = "input.txt"
$Reader = New-Object System.IO.StreamReader($InputFile)
$a = 1
while (($Line = $Reader.ReadLine()) -ne $null) {
    if ($Line -match "\[c\. .*?\]") {
        $OutputFile = "MySplittedFileNumber$a$Matches.txt"
        $a++
    }    
    Add-Content $OutputFile $Line
}

all the files are named with MySplittedFileNumber1System.Collections.Hashtable.txt instead of the match, with "$Matches[0]" I'm told that the variable does not exist or has been filtered by -Exclude.

All my attempts of setting the $regex before executing seems to go nowhere, can someone point me on how to get the result filenames formatted as MySplittedFileNumber[c. 1r].txt.

Using just a partial match as \[(c\. .*?)\] would be even better, but once I know how to retrieve the match, I bet I can find the solution. I can do the variable 1r 1v setting in $a, somehow, but I'd rather use the one inside the txt file, since some folio may have been misnumbered in the manuscript and I need to retain this.

Content of original input.txt:

> [c. 1r]
Text paragraph
text paragraph
...
Text paragraph
[c. 1v]
Text paragraph
text paragraph
...
Text paragraph
[c. 2r]
Text paragraph
text paragraph
...
Text paragraph

Desired result:

Content of MySplittedFileNumber[c. 1r].txt:

> [c. 1r]
    Text paragraph
    text paragraph
    ...
    Text paragraph

Content of MySplittedFileNumber[c. 1v].txt:

> [c. 1v]
    Text paragraph
    text paragraph
    ...
    Text paragraph

Content of MySplittedFileNumber[c. 2r].txt:

> [c. 2r]
    Text paragraph
    text paragraph
    ...
    Text paragraph

Just my first guess, but did you try: `$OutputFile = "MySplittedFileNumber$a$($Matches[0]).txt"` — Paxz, Jul 19 '18 at 11:37
Please show a meaningful sample of your input and the desired output. — Tomalak, Jul 19 '18 at 11:38
@Tomalak `$Matches` does'nt need to be defined. It gives the Value that matched with your last use of `-match`. — Paxz, Jul 19 '18 at 11:39
I updated the question with the examples, thank for your suggestion @Tomalak — Lila, Jul 19 '18 at 11:51

score 2 · Accepted Answer · answered Jul 19 '18 at 11:58

2

I tried to reproduce it and with a little change it worked:

$InputFile = "input.txt"
$Reader = New-Object System.IO.StreamReader($InputFile)
$a = 1
While (($Line = $Reader.ReadLine()) -ne $null) {

    If ($Line -match "\[c\. .*?\]") {
        $OutputFile = "MySplittedFileNumber$a$($Matches[0]).txt"
        $a++
    }    
    Out-File -LiteralPath "<yourFolder>\$OutputFile" -InputObject $Line -Append
}

To call a position of an array while in "" you have to format the variable like this $($array[number])
To write to the file, you should give the Fullpath and not just the Filename.

answered Jul 19 '18 at 11:58

Paxz

2,959
1
20
34

Thank you, it is working but the "Out-File - LiteralPath" seems to give me some problems with unicode characters I did not had with "Add Content". Any idea? – Lila Jul 19 '18 at 12:15
1

@Lila Try to play around with the `-Encoding` option of `Out-File`. – Paxz Jul 19 '18 at 12:17
After having some issues, I decided that for me it was easier to convert the original to utf8 (was Ansi at beginning) and this solved the problem. – Lila Jul 19 '18 at 12:39
@Lila Thats also an option :P If the answer was correct, don't forget to mark it as correct ;) – Paxz Jul 19 '18 at 12:43
For those interested in using just a part of the regex, I simply used positive lookbehind and lookahead like this `$Line -match "(?<=\[c\. +).*?(?=\].*?)"` in order to get only the folio number and r-v. Since we are planning to adapt it to several transcriptions, it was well worth the scripting time! – Lila Jul 19 '18 at 14:24

score 0 · Answer 2 · answered Jul 19 '18 at 14:35

From Version 3 on PowerShells Get-Content cmdlet has the -Raw parameter which allows to read a file as a whole into a string you can then split into chunks with a regular exression (using a positive look ahead ).

The very same RegEx can be use to grep the section name and insert into the destination file name.

## Q:\Test\2018\07\19\SO_51421567.ps1
##
$RE = [RegEx]'(?=(\[c\. \d+[rv]\]))'

$Sections = (Get-Content '.\input.txt' -raw) -split $RE -ne ''

ForEach ($Section in $Sections){
    If ($Section -Match $RE){
        $Section | Out-File -LiteralPath ("MySplittedFileNumber{0}.txt" -f $Matches[1])
    }
}

Splitting text in PowerShell using content delimiter as filename

2 Answers2