1

I am complete new in writting powershell scripts. So far I was using plain batch for my purpose as this is the requirement by my company. Inside this batch I am using nested foor loops to make a comparison of two .txt files, in detail I wantdo do the following:

  • File 1 contains lots of strings. Each string is in one seperate line with a preceded number and semicolon like so: 658;RMS
  • File 2 is some long text.

The aim is to count the amount of occurences of each string from File 1 in File 2, e.g. RMS is counted 300 times.

As my previous code hase some huge drawbacks concerning runtime (File 1 has approx. 400 lines and File 2 500.000) I read that the Select-String from Powershell is much more efficient. However, as I am reading some tutorials it is not clear to me how I can proceed here, beside that I have to run the powershellcode inside my .bat. My biggest problem is I am not sure how and where to place my 'variables', so the two inputfiles 1 and 2

So far I was testing the Select-String method like this:

powershell -command "& {Select-String -Path *.txt -Pattern "RMS"}"

My assumption would be to make use of piping, so something like this:

powershell -command "& {<<path to file one, should read line by line>> | Select-String -Path File2.txt -Pattern "value of file 1"}"

However, I am not getting this to work. Powershell is excpecting some kind of psobject before the first pipe?

Compo
  • 36,585
  • 5
  • 27
  • 39
SRel
  • 383
  • 1
  • 11

3 Answers3

3

For optimal performance, I would approach this task like so.

  • Read the file with the terms as a CSV (it is a CSV, with a ; delimiter)
  • Read the other file into a string
  • For each term, count how often it can be found in the target string (using .IndexOf())

For example

$data = Import-Csv "file1.txt" -Delimiter ";" -Header ID,Term 
$target = Get-Content "file2.txt" -Raw
$counts = @{}

foreach ($term in $data.Term) {
    $index = -1
    $count = 0
    do {
        $index = $target.IndexOf($term, $index + 1)
        if ($index -gt -1) { $count++ } else { break; }
    } while ($true);
    $counts[$term] = $count
}

$counts 

Notes

  • Import-Csv will automatically use the first line in the input file as the header. If your file already has a header, you can remove the -Headers parameter.
  • Get-Content will will read the input file into an array of lines by default. But for this approach, having the entire file as one big string is the right thing - that's what -Raw does.
  • @{} creates an empty hashtable
  • $data.Term will access one column of the CSV
  • .IndexOf() is case sensitive. By default, PowerShell is case-insenstive, but native .NET methods like this one will not change their behavior. This might or might not be what you need - use .ToLower() on the $target and the $term if you don't care for case.
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • I tested your approach also and its quite fast :). is there some easy modification so it will only save those terms with a count higher than zero to $counts? More, I have to modify the search expression with regular expressions so it only counts exact matches. As i am not familiar with powershell, where would be the right point to add this in your code? – SRel May 25 '20 at 16:38
  • *"is there some easy modification so it will only save those terms with a count higher than zero to `$counts`"* - Yes, there is, and I am sure you will find it. It's not difficult. :) -- *"I have to modify the search expression with regular expressions so it only counts exact matches."* - Huh? The above code does only count exact matches. Regular expressions are for situations where you *don't* want exact matches. – Tomalak May 25 '20 at 16:46
  • Oh okay I forgot to mention many sorry for this. In my file2 there are severla lines. For example I want to count the occurence of 'RM4' Now there can exists the following lines: 123456789 RM4 987654321 -> should be counted as 1 However, the occurence in this line should not be counted: 12345 RM4.DLL 9876 So my aim was to capsulate the search term in white spaces so it is not followed by anything else :) – SRel May 25 '20 at 16:57
  • @SRel You can still do that. "Encapsulating the `$term` in spaces" is another super easy modification, but I'm not going to do it for you. I'm not trying to be difficult, but to encourage you to read the code I wrote and to understand what it does. Because as soon as you understand these few lines, you will know what to do. But if I make those changes, I'm not helping you understand. – Tomalak May 25 '20 at 17:01
  • 1
    Great I will try my best thank you for all your help, indeed that should be not so difficult – SRel May 25 '20 at 17:04
  • @SRel That's the spirit. – Tomalak May 25 '20 at 17:05
  • 1
    oh okay first thing was easy. I understood you code, very intelligent approach you did. I now added `if ($count -gt 0) {$counts[$term] = $count}` and replaced `$counts[$term] = $count` with this – SRel May 25 '20 at 18:00
  • i now have modified the answer from Mathias by adding a \s. However I am not sure where I can add a RegEx in your code. My other approach would be to just modify the searchterm by adding a whitespace at the end, would this be okay in this case? – SRel May 25 '20 at 18:39
  • @SRel Very good on the first point, that was exactly right. On the second point - you don't need regex. Just add a space `" "` to the front and back of `$term` and you're done. – Tomalak May 25 '20 at 19:04
2

Select-String is useful, but it isn't magic :)

Performance impact in mind, I would approach it like this:

  • For each line in File2:
    • Test for occurences of all terms in File1

This way, you only need to read and evalulate File2 once:

# prepare hashtable to keep track of count
$count = @{}

# read terms to search for from file1
$termsToFind = Get-Content .\file1 |ForEach-Object {
  $_ -split ';' |Select -Last 1
}

# loop over lines in file2, count the words we're searching for
Get-Content .\test\file2 |ForEach-Object {
  foreach($term in $termsToFind){
    # Using `Regex.Matches()` will help us find multiple occurrences of the same term
    $count[$term] += [regex]::Matches($_,"\b$([regex]::Escape($term))\b").Count
  }
}

Now $count will be a hashtable where the key is the term from file1, and the value is the count of each word.

Output to the same format as file1 with:

$count.GetEnumerator() |ForEach-Object { $_.Value,$_.Key -join ';' } |Set-Content output.txt
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • nice, but you forgot, that the content of `file1` is CSV-like `658;RMS` where he only needs the second column. – T-Me May 25 '20 at 13:47
  • @T-Me thanks for spotting it, completely forgot that part :) – Mathias R. Jessen May 25 '20 at 14:20
  • many thanks @MathiasR.Jessen I tested the first part where you read in file1 and this works perfectly fine. However trying to impement the second part reults in some rror cases. Did this occur beacuse I trying to fit the whole code in one line as I cannot run external powershell scripts? My code looks like this: – SRel May 25 '20 at 14:57
  • `powershell -command "& {$count = @{}; $termsToFind = Get-Content 'ModulID.txt' |ForEach-Object {$_ -split ';' |Select -Last 1}; Get-Content 'TlsTrace.prn' |ForEach-Object {foreach($term in $termsToFind){$count[$term] += [regex]::Matches($_,"\b$([regex]::Escape($term))\b").Count}}}"` I changed the filenames to the real one, the other things are identical – SRel May 25 '20 at 14:58
  • I receive lot of `expressions missing`, however the first part is working – SRel May 25 '20 at 14:59
  • 1
    I'd strongly suggest either putting the code in a `.ps1` script file and then run `powershell -file C:\path\to\script.ps1`, or turning it into [an encoded command](https://devblogs.microsoft.com/scripting/powertip-encode-string-and-execute-with-powershell/) – Mathias R. Jessen May 25 '20 at 15:02
  • Exclude the code in an exernal .ps1 script was my first intention, however running ps-scripts are blocked because of the policy and I am not able to override them :( However, I can run simple PS-commands inside my batch, they are working – SRel May 25 '20 at 15:36
  • okay I wil try this. so the purpose of the encoded commands is to hide that batch is executing a ps-script? – SRel May 25 '20 at 16:11
  • @SRel No, the purpose is to be able to run a whole script as a one-liner without having to worry about escaping the input script on the command line - ie. exactly what you're struggling with :) – Mathias R. Jessen May 25 '20 at 16:15
  • Ah thank you :) to now I am able to use the script inside a poweshell environment for testing and all seems to work. One last point as you are using regEx. My problem is that the exact phrase, e.g. `RM4` can also occur in the combination `RM4.dll, however the later one should not be counted. Is there some way to capsulate the search term for example in white spaces? – SRel May 25 '20 at 17:03
  • @MathiasR.Jessen thanks for your help. As both solutions are working I decided to mark Tomalaks answer as it was a bit faster than yours. However many thanks again for your help :) – SRel May 25 '20 at 18:46
1

If you check the docs, you can't pipe -pattern to select-string. You can use parentheses to make the output of something become the pattern argument:

powershell select-string -pattern (get-content file1) -path file2    

Using the fact that pattern is position 0 and path is position 1. -pattern can also be an array.

powershell select-string (get-content file1) file2  
js2010
  • 23,033
  • 6
  • 64
  • 66