2

Donald Knuth once got the task to write a literate program computing the word frequency of a file.

Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

Doug McIlroy famously rewrote the 10 pages of Pascal in a few lines of sh:

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

As a little exercise, I converted this to Powershell:

(-split ((Get-Content -Raw test.txt).ToLower() -replace '[^a-zA-Z]',' ')) |
  Group-Object |
  Sort-Object -Property count -Descending |
  Select-Object -First $Args[0] |
  Format-Table count, name

I like that Powershell combines sort | uniq -c into a single Group-Object.

The first line looks ugly, so I wonder if it can be written more elegantly? Maybe there is a way to load the file with a regex delimiter somehow?

One obvious way to shorten the code would be to uses the aliases, but that does not help readability.

qznc
  • 1,113
  • 2
  • 12
  • 25
  • 2
    Since your script is working and you aren't getting any errors, I think this would be better suited to code review: https://codereview.stackexchange.com/ – I.T Delinquent May 30 '19 at 12:45
  • 2
    You could drop the `.ToLower()`, take out the caps `A-Z` from the replace, because by default both `-replace` and `Group-Object` work case-insensitively. – Theo May 30 '19 at 12:51
  • NOTE: McIlroy didn't "rewrite" Knuth's solution. He only showed how the same task could be done by reusing standard Unix programs. Naturally, this approach is much slower than highly efficient Knuth's solution. – Andriy Makukha Feb 02 '20 at 20:19

3 Answers3

2

I would do it this way.

PS C:\users\me> Get-Content words.txt
One one
two
two
three,three.
two;two


PS C:\users\me> (Get-Content words.txt) -Split '\W' | Group-Object

Count Name                      Group
----- ----                      -----
    2 One                       {One, one}
    4 two                       {two, two, two, two}
    2 three                     {three, three}
    1                           {}

EDIT: Some code from Bruce Payette's Windows Powershell in Action

# top 10 most frequent words, hash table
$s = gc songlist.txt
$s = [string]::join(" ", $s)
$words = $s.Split(" `t", [stringsplitoptions]::RemoveEmptyEntries)
$uniq = $words | sort -Unique
$words | % {$h=@{}} {$h[$_] += 1}
$frequency = $h.keys | sort {$h[$_]}
-1..-10 | %{ $frequency[$_]+" "+$h[$frequency[$_]]}

# or
$grouped = $words | group | sort count
$grouped[-1..-10]
js2010
  • 23,033
  • 6
  • 64
  • 66
  • 1
    As group content isn't relevant, append the `-NoElement` to Group-Object to be a bit more efficient. With real word text (like OPs posting) you should try: `(Get-Content words.txt) -Split '\W' -ne '' | Group-Object -NoElement|Where Count -gt 1|Sort count -desc` –  May 30 '19 at 17:21
  • Why the `Where Count -gt 1`? – qznc May 30 '19 at 18:05
1

Thanks js2010 and LotPings for important hints. To document what is probably the best solution:

$Input -split '\W+' |
  Group-Object -NoElement |
  Sort-Object count -Descending |
  Select-Object -First $Args[0]

Things I learned:

  • $Input contains stdin. This is closer to McIlroys code than Get-Content some file.
  • split can actually take regex delimiters
  • the -NoElement parameter let me get rid of the Format-Table line.
qznc
  • 1,113
  • 2
  • 12
  • 25
  • 2
    Way shorter: `-split $input | group -n | sort c* | select -l 1`. End effect is shorter then Dougs bash variant and more readble. – majkinetor May 31 '19 at 08:47
0

Windows 10 64-bit. PowerShell 5

How to find what whole word (the not -the- or weather) regardless of case is most frequently used in a text file and how many times it is used using Powershell:

Replace 1.txt with your file.

$z = gc 1.txt -raw
-split $z | group -n | sort c* | select -l 1

Results:

Count Name
----- ----
30    THE
somebadhat
  • 744
  • 1
  • 5
  • 17