I am doing canonicalization of some PowerShell dataset and one processing step is to replace all variables with X and all string literals with Y so that I can detect and remove nearly-duplicates.
However, I noticed that for a lot of scripts after canonicalization the whole script boils down to a lot of Y's and some X's with barely any other code. This is not what I anticipated, as there are only handful of variables and string literals in the scripts.
To find all String Literals I used the command:
$Strings = $AST.FindAll({$args[0] -is System.Management.Automation.Language.StringConstantExpressionAst]}, $true)
To troubleshoot this I used ShowPSAst (PowerShell AST visualization tool) to visualize one sample script where the above problem was noticeable.
The original script looks like this:
Describe "Files" -Tag OSX,Linux {
It "is utf-8 encoded" {
$true | Should Be $false
}
It "uses Unix-style line endings" {
$true | Should Be $false
}
It "has a shebang" {
$true | Should Be $false
}
}
Describe "Placeholder for Nano tests" -Tag Nano {
}
After canonicalization I obtain the following:
Y Y -Tag Y,Y {
Y Y {
X | Y Y X
}
Y Y {
X | Y Y X
}
Y Y {
X | Y Y X
}
}
Y Y -Tag Y {
}
An excerpt of the AST visualization for the above script:
Note that the highlighted part in the right panel of the image corresponds to the AST node CommandAST
in the left panel, which then has lots of StringConstantExpressionAst
nodes as children. Looking at these AST nodes it makes sense why there are so many Y's in my canonical version. However, what's confusing me is why nearly all of the individual tokens in the highlighted code are treated as StringContantExpressionAst
. I would expect only "Placeholder for Nano tests"
to be treated as a String Literal.
To be precise, I would expect
Describe "Placeholder for Nano tests" -Tag Nano
to be transformed into
Describe Y -Tag Nano
and NOT into
Y Y -Tag Y
I don't really use PowerShell on my own and don't know its intricacies, so I apologize if I'm missing something basic and I am thankful in advance for any help in understanding this PowerShell behavior.