4

I started with Project Gutenberg's "The Complete Works of William Shakespeare by William Shakespeare", a UTF-8 text file available from http://www.gutenberg.org/ebooks/100. In PowerShell, I ran

Get-Content -Tail 50 $filename | Sort-Object -CaseSensitive

which - I believe - piped the last 50 lines (i.e., strings delimited by line breaks) of the file to Sort-Object, which was configured to sort alphabetically with strings beginning with lowercase letters before strings beginning with uppercase letters.

Why is the output in the following image (especially in the P's) not sorting according to the -CaseSensitive switch? What is a solution?

Link to Sort-Output Picture

Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
Tom Lever
  • 321
  • 3
  • 16
  • 3
    Looks correct to me. The -CaseSensitive priorities lowercase over uppercase if the word is the same, but it does not take all lowercase first over uppercase. eg. pleasant > Pleasant > please > Please > portal > Power. – Drew Jul 06 '18 at 04:01

3 Answers3

7

Note: This answer focuses on the general case of sorting entire strings (by all of their characters, not just by the first one).

You're looking for ordinal sorting, where characters are sorted numerically by their Unicode code point ("ASCII value") and therefore all uppercase letters, as a group, sort before all lowercase letters.

As of Windows PowerShell v5.1 / PowerShell Core v7.0, Sort-Object invariably uses lexical sorting[1] (using the invariant culture by default, but this can be changed with the -Culture parameter), where case-sensitive sorting simply means that the lowercase form of a given letter comes directly before its uppercase form, not all letters collectively; e.g., b sorts before B, but they both come after both a and A (also, the logic is reversed from the ordinal case, where it is uppercase letters that come first):

PS> 'B', 'b', 'A', 'a' | Sort-Object -CaseSensitive
a
A
b
B

There is a workaround, however, which (a) sorts uppercase letters before lowercase ones and (b) comes at the expense of performance:

  • For better performance via direct ordinal sorting you need to use the .NET framework directly - see below, which also offers a solution to sort the lowercase letters first.
  • Enhancing Sort-Object to also support ordinal sorting is being discussed in this GitHub issue.
# PSv4+ syntax
# Note: Uppercase letters come first.
PS> 'B', 'b', 'A', 'a' |
      Sort-Object { -join ([int[]] $_.ToCharArray()).ForEach('ToString', 'x4') } 
A
B
a
b

The solution maps each input string to a string composed of the 4-digit hex. representations of the characters' code points, e.g. 'aB' becomes '00610042', representing code points 0x61 and 0x42; comparing these representations is then equivalent to sorting the string by its characters' code points.


Use of .NET for direct, better-performing ordinal sorting:

# Get the last 50 lines as a list.
[Collections.Generic.List[string]] $lines = Get-Content -Tail 50 $filename

# Sort the list in place, using ordinal sorting
$lines.Sort([StringComparer]::Ordinal)

# Output the result.
# Note that uppercase letters come first.
$lines

[StringComparer]::Ordinal returns an object that implements the [System.Collections.IComparer] interface.

Using this solution in a pipeline is possible, but requires sending the array of lines as a single object through the pipeline, which the -ReadCount parameter provides:

Get-Content -Tail 50 $filename -ReadCount 0 | ForEach-Object { 
  ($lines = [Collections.Generic.List[string]] $_).Sort([StringComparer]::Ordinal)
  $lines # output the sorted lines 
}

Note: As stated, this sorts uppercase letters first.


To sort all lowercase letters first, you need to implement custom sorting by way of a [System.Comparison[string]] delegate, which in PowerShell can be implemented as a script block ({ ... }) that accepts two input strings and returns their sorting ranking (-1 (or any negative value) for less-than, 0 for equal, 1 (or any positive value) for greater-than):

$lines.Sort({ param([string]$x, [string]$y)
  # Determine the shorter of the two lengths.
  $count = if ($x.Length -lt $y.Length) { $x.Length } else { $y.Length }
  # Loop over all characters in corresponding positions.
  for ($i = 0; $i -lt $count; ++$i) {
    if ([char]::IsLower($x[$i]) -ne [char]::IsLower($y[$i])) {
      # Sort all lowercase chars. before uppercase ones.
      return (1, -1)[[char]::IsLower($x[$i])]
    } elseif ($x[$i] -ne $y[$i]) { # compare code points (numerically)
      return $x[$i] - $y[$i]
    }
    # So far the two strings compared equal, continue.
  }
  # The strings compared equal in all corresponding character positions,
  # so the difference in length, if any, is the decider (longer strings sort
  # after shorter ones).
  return $x.Length - $y.Length
})

Note: For English text, the above should work fine, but in order to be support all Unicode text potentially containing surrogate code-unit pairs and differing normalization forms (composed vs. decomposed accented characters), even more work is needed.


[1] On Windows, so-called word sorting is performed by default: "Certain non-alphanumeric characters might have special weights assigned to them. For example, the hyphen (-) might have a very small weight assigned to it so that coop and co-op appear next to each other in a sorted list."; on Unix-like platforms, string sorting is the default, where no special weights apply to non-alphanumeric chars. - see the docs.

mklement0
  • 382,024
  • 64
  • 607
  • 775
3

A way to get the desired result is to grab the first character of each string and cast it to an Int, this will provide you with the ASCII code for that character which you can then sort numerically into the desired order.

Get-Content -Tail 50 $filename | Sort-Object -Property @{E={[int]$_[0]};Ascending=$true} 

We can create an expression using the -property parameter of sort-object, we cast to int using [int] and then grab the first character using $_ to take the current String/line that's in the pipeline and then [0] to take the first character in that string and the sort it in Ascending value.

This provides the following output.

You may wish to trim the whitespace from the output however, I'll leave that up to you to decide.

 

















    DONATIONS or determine the status of compliance for any particular state
    Foundation, how to help produce our new eBooks, and how to subscribe to
    Gutenberg-tm eBooks with only a loose network of volunteer support.
    International donations are gratefully accepted, but we cannot make any
    Most people start at our Web site which has the main PG search facility:
    Project Gutenberg-tm eBooks are often created from several printed
    Please check the Project Gutenberg Web pages for current donation
    Professor Michael S. Hart was the originator of the Project Gutenberg-tm
    Section 5. General Information About Project Gutenberg-tm electronic
    This Web site includes information about Project Gutenberg-tm, including
    While we cannot and do not solicit contributions from states where we
    against accepting unsolicited donations from donors in such states who
    approach us with offers to donate.
    concept of a library of electronic works that could be freely shared
    considerable effort, much paperwork and many fees to meet and keep up
    editions, all of which are confirmed as not protected by copyright in
    have not met the solicitation requirements, we know of no prohibition
    how to make donations to the Project Gutenberg Literary Archive
    including checks, online payments and credit card donations. To donate,
    methods and addresses. Donations are accepted in a number of other ways
    necessarily keep eBooks in compliance with any particular paper edition.
    our email newsletter to hear about new eBooks.
    please visit: www.gutenberg.org/donate
    statements concerning tax treatment of donations received from outside
    the United States. U.S. laws alone swamp our small staff.
    the U.S. unless a copyright notice is included. Thus, we do not
    visit www.gutenberg.org/donate
    with anyone. For forty years, he produced and distributed Project
    www.gutenberg.org
    we have not received written confirmation of compliance. To SEND
    with these requirements. We do not solicit donations in locations where
    works.

Update

To sort lowercase first, and trim blank lines. Essentially I'm just multiplying the ascii number by an arbitrary amount so that numerically it is higher than it's lowercase counterparts.

In the sample text, no lines start with special characters or punctuation, this would probably need to modified to handle those scenarios correctly.

Get-Content -Tail 50 $filename | ? { -not [string]::IsNullOrEmpty($_) } | Sort-Object -Property {
    if($_[0] -cmatch "[A-Z]")
    {
        5*[int]$_[0]
    }
    else
    {
        [int]$_[0]
    } 
}

This will output:

against accepting unsolicited donations from donors in such states who
approach us with offers to donate.
considerable effort, much paperwork and many fees to meet and keep up
concept of a library of electronic works that could be freely shared
editions, all of which are confirmed as not protected by copyright in
how to make donations to the Project Gutenberg Literary Archive
have not met the solicitation requirements, we know of no prohibition
including checks, online payments and credit card donations. To donate,
methods and addresses. Donations are accepted in a number of other ways
necessarily keep eBooks in compliance with any particular paper edition.
our email newsletter to hear about new eBooks.
please visit: www.gutenberg.org/donate
statements concerning tax treatment of donations received from outside
the U.S. unless a copyright notice is included. Thus, we do not
the United States. U.S. laws alone swamp our small staff.
visit www.gutenberg.org/donate
with these requirements. We do not solicit donations in locations where
works.
www.gutenberg.org
with anyone. For forty years, he produced and distributed Project
we have not received written confirmation of compliance. To SEND
DONATIONS or determine the status of compliance for any particular state
Foundation, how to help produce our new eBooks, and how to subscribe to
Gutenberg-tm eBooks with only a loose network of volunteer support.
International donations are gratefully accepted, but we cannot make any
Most people start at our Web site which has the main PG search facility:
Please check the Project Gutenberg Web pages for current donation
Professor Michael S. Hart was the originator of the Project Gutenberg-tm
Project Gutenberg-tm eBooks are often created from several printed
Section 5. General Information About Project Gutenberg-tm electronic
This Web site includes information about Project Gutenberg-tm, including
While we cannot and do not solicit contributions from states where we
Jacob
  • 1,182
  • 12
  • 18
  • 1
    Your solution worked really well. I simplified it into: echo '.' Get-Content -Tail 10 $filename | Sort-Object -Property {[int]$_[0]} – Tom Lever Jul 15 '18 at 13:59
  • I'm glad it helped, as a side note, if you wanted to swap it so that lowercase is above the uppercase, then I've updated the answer to show a solution I've come up with (there might be a better way to do it). – Jacob Jul 15 '18 at 16:27
0

Comparing Jacob and mklement0's responses, Jacob's solution has the advantages of being visually simple, being intuitive, using pipelines, and being extendable to sorting by second character of first word, or first character of second word, etc. mklement0's solution has the advantages of being faster and giving me ideas of how to sort lowercase then uppercase.

Below I want to share my extension of Jacob's solution, which sorts by first character of second word. Not particularly useful for the Complete Works of Shakespeare, but very useful for a comma-separated table.

Function Replace-Nulls($line) {

 $dump_var = @(
      if ( !($line) ) {
           $line = [char]0 + " " + [char]0 + " [THIS WAS A LINE OF NULL WHITESPACE]"
      } # End if
      if ( !(($line.split())[1]) ) {
           $line += " " + [char]8 + " [THIS WAS A LINE WITH ONE WORD AND THE REST NULL WHITESPACE]"
      } # End if
 ) # End definition of dump_var

 return $line

} # End Replace-Nulls

echo "."
$cleaned_output = Get-Content -Tail 20 $filename | ForEach-Object{ Replace-Nulls($_) }
$cleaned_output | Sort-Object -Property {[int]((($_).split())[1])[0]}
Tom Lever
  • 321
  • 3
  • 16
  • Unfortunately, your question is ambiguous: your own (unsuccessful) attempt with `Sort-Object` definitely sorts by _all_ characters in the input strings, not just the _first_ one. Sorting by only the first letter strikes me as an exotic use case. Instead of posting this answer, I suggest moving your contrasting of the two other answers into your _question_, and omitting the code snippet altogether, as it is incidental to the question you asked and probably just a distraction to future readers. – mklement0 Jul 15 '18 at 16:21
  • P.S.: My sort-by-all-chars. solutions can be made to work in the pipeline too, but not easily (see my update). Ultimately, only enhancing `Sort-Object` will provide a good solution. – mklement0 Jul 15 '18 at 18:21