1

I have an object $Posts which contain a title, and a SimTitles field amongst a few others. I need to compare each title to the other titles and give it a Similarity score in the SimTitles field. So if I have 80 $Posts, it will need to cover 6400 re-iterations as each title needs to be scored vs the others.

Apart from the Measure-TitleSimilarity routine which I believe is optimized, can anyone see a way to improve the speed of this double loop that I am missing?

Edit: I have included the function Measure-TitleSimilarity. I am actually passing the array to the function. The whole topic of quantifying arrays for likeness is fascinating. I have tried with Title.ToCharArray() which changes the magic number to a much higher number. It also can produce a match with two completely different titles as long as the characters are the same. (Ex: 'Mother Teresa' would closely match 'Earthmovers' or 'Thermometer' yet clearly not the same meaning). Cosine Similarity if just one method but it seemed easiest to process. @Mclayton and @bryancook - I see the light with your suggestion, but can't grasp tracking what no longer needs to be looked at for similar words.

Function Get-SimTitles([psobject]$NewPosts) {

  $CKTitles = $NewPosts.title

  foreach ($Ck in $CkTitles) {
    $NewPosts | & { 
      process { 
        if ((Measure-TitleSimilarity $Ck.split(' ') $_.title.split(' ') -gt .2) {
          $_.SimTitles = $_.SimTitles + 1 
        } 
      } 
    } 
  }
}

Function Measure-TitleSimilarity
{
## Based on VectorSimilarity by .AUTHOR Lee Holmes 
## Modified slightly to match use

[CmdletBinding()]
param(
    
    [Parameter(Position = 0)]
    $Title1,

    [Parameter(Position = 1)]   
    $Title2
    
        
) 

$allkeys = @($Title1) + @($Title2) |  Sort-Object -Unique

$set1Hash = @{}
$set2Hash = @{}
$setsToProcess = @($Title1, $Set1Hash), @($Title2, $Set2Hash)

foreach($set in $setsToProcess)
{
    $set[0] | Foreach-Object {
         $value = 1 
         $set[1][$_] = $value
    }
}

$dot = 0
$mag1 = 0
$mag2 = 0

foreach($key in $allkeys)
{
    $dot += $set1Hash[$key] * $set2Hash[$key]
    $mag1 +=  ($set1Hash[$key] * $set1Hash[$key])
    $mag2 +=  ($set2Hash[$key] * $set2Hash[$key])
}

$mag1 = [Math]::Sqrt($mag1)
$mag2 = [Math]::Sqrt($mag2)

return [Math]::Round($dot / ($mag1 * $mag2), 3)

}
Spodg
  • 33
  • 6
  • 2
    You don't have to compare _every title_ to _every title_ - instead, you only need to compare all unique pairs of titles, then update both of them at once – Mathias R. Jessen Feb 05 '22 at 18:13
  • Would help to understand the way they're being scored and compared in the title similarity method, because maybe you don't need n(n) comparisons. You might be able to sort the list into a more effective lookup. Also, not sure why you're using the '&' operator here - this is creating a separate scope and might lead to additional overhead – bryanbcook Feb 05 '22 at 18:17
  • Yes - if I understand your comment correctly, it is really 80*79 re-iterations as it does not need to compare itself. 6320 iterations. – Spodg Feb 05 '22 at 18:18
  • I have modified a Cosine-Similarity routine. I take the title and split it into an array of words so it ultimately compares the two word arrays. I tried comparing character arrays but felt it did not produce as accurate a result. As for the & operator - would it be faster to use Where-Object {Measure-TitleSimilarity}? – Spodg Feb 05 '22 at 18:24
  • Script block with a process block, similar to _PowerShell filter_ is a pretty efficient enumerating and filtering technique, certainly faster than `Where-Object` (by far). I wouldn't worry about that. I think it's best if you share the data being compared (a minimal reproducible example) as well as the `Measure-TitleSimilarity` function code – Santiago Squarzon Feb 05 '22 at 18:33
  • 1
    Per @bryanbcook, if you want to avoid comparing every object to every other object you’re going to have to take advantage of the specifics of ```Measure-TitleSimilarity``` - for example if ```A``` and ```B``` are 0.1 “similar”, and ```A``` and ```Z``` are 0.8 similar, perhaps you don’t need to compare ```B``` and ```Z``` to know their similarity is > 0.2. Posting the code for your ```Measure-TitleSimilarity``` function might help. Also, see https://stackoverflow.com/questions/1889065/fast-similarity-detection?rq=1 for a “similar” (but generalised) question. – mclayton Feb 05 '22 at 19:20
  • For example, you could split your titles into word arrays once before you start comparing, and then loop over *those*, rather than splitting them every time inside your ```Measure-TitleSimilarities``` - that would mean only 80 split operations rather that 2 x 80 x 79 = 12,640. – mclayton Feb 05 '22 at 19:32

3 Answers3

1

you can half the processing time by removing duplicate comparisons. I.e. once you compared "title1" and "title2", you don't need to compare "title2" and "title1" - you already know the answer. So, your inner loop should not start from the beginning of the array

MikeSh
  • 352
  • 1
  • 5
  • Yes- a possible solution, and would reduce the work in half. It also changes the loop from using a pipeline to an array counter. I am going to play with this concept to make sure there is not a performance drop. From the advice above, I still like the idea of uploading all the titles, and figuring a way to only run 80 iterations through the entire pool. I am playing with some keyword matching, I just worry if I can have a All-title-array big enough. – Spodg Feb 06 '22 at 04:38
1

Partial answer

This includes a couple of the suggestions from the comments:

  • Mathias R. Jessen - "You don't have to compare every title to every title - instead, you only need to compare all unique pairs"

  • my comment - "you could split your titles into word arrays once before you start comparing, and then loop over those, rather than splitting them every time"

$ErrorActionPreference = "Stop";
Set-StrictMode -Version "Latest";

function ConvertTo-WordSets( [psobject] $Posts )
{

    # preprocess each post to break its title into word counts 
    # so we don't need to do it every time we compare 2 posts

    foreach( $post in $Posts )
    {
        $set = new-object PSCustomObject -Property ([ordered] @{
            "Post"   = $post
            "Title"  = $post.Title.Trim()
            "Words"  = $null
            "Counts" = $null
        });
        $set.Words  = $set.Title.Split(" ");
        $set.Counts = $set.Words `
            | group-object `
            | foreach-object `
                -Begin   { $counts = @{} } `
                -Process { $counts.Add($_.Name, $_.Count) } `
                -End     { $counts };
        write-output $set;
    }

}

function Get-SimTitles( [psobject] $NewPosts )
{

    # instead of comparing every object to every object, just compare unique combinations
    # e.g. X compared to Y is the same as Y compared to X so score them both at the same time
    # (and we don't need to compare an object to itself either)

    for( $i = 0; $i -lt $NewPosts.Length; $i++ )
    {
        $left = $NewPosts[$i];
        for( $j = $i + 1; $j -lt $NewPosts.Length; $j++ )
        {
            $right = $NewPosts[$j];
            if ((Measure-TitleSimilarity2 $left $right) -gt .5)
            {
                $left.Post.SimTitles  = $left.Post.SimTitles + 1;
                $right.Post.SimTitles = $right.Post.SimTitles + 1;
            } 
        } 
    }

}

Function Measure-TitleSimilarity
{
    param
    (
        [Parameter(Position = 0)]
        $Left,
        [Parameter(Position = 1)]   
        $Right
    ) 

    # we can use the pre-processed word counts now

    $allkeys = $Left.Words + $Right.Words | Sort-Object -Unique

    $dot = 0
    $mag1 = 0
    $mag2 = 0

    foreach($key in $allkeys)
    {
        $dot  += $Left.Counts[$key] * $Right.Counts[$key]
        $mag1 += $Left.Counts[$key] * $Left.Counts[$key]
        $mag2 += $Right.Counts[$key] * $Right.Counts[$key]
    }

    $mag1 = [Math]::Sqrt($mag1)
    $mag2 = [Math]::Sqrt($mag2)

    return [Math]::Round($dot / ($mag1 * $mag2), 3)

}

Performance

Neither this nor the original are particularly fast for even moderately sized samples, but this one is about 4 times faster.

# get some test data
$sentences = (Invoke-WebRequest -Uri "https://raw.githubusercontent.com/SteveMansfield/MNREAD-sentences/master/XMNREAD01.txt").Content;
$sentences = $sentences.Trim("`n").Split("`n") | foreach-object { $_.Substring(1, $_.Length - 3) };

$posts = $sentences `
    | select-object -First 200 `
    | foreach-object {
        new-object PSCustomObject -Property ([ordered] @{
            "Title"     = $_
            "SimTitles" = 0
        })
    };
Measure-Command { Get-SimTitles $posts; }

# build some test data
$posts = $sentences `
    | select-object -First 200 `
    | foreach-object {
        new-object PSCustomObject -Property ([ordered] @{
            "Title"     = $_
            "SimTitles" = 0
        })
    };

Measure-Command {
    $wordSets = @( ConvertTo-WordSets $Posts );
    Get-SimTitles $wordSets;
}
Size Original This one
10 0.2 0.02
20 0.4 0.1
50 1.9 0.5
100 8.7 1.9
200 38 9
500 246 82

(Times in seconds)

mclayton
  • 8,025
  • 2
  • 21
  • 26
  • Clearly it is faster, and tackled the other angle - a better way to look at the titles through the word sets. What I don't understand, and thanks for the contribution, is how does the original $posts get updated? I see where the WordSet.post.SimTitle is updated but am amazed that it sticks when you use the original $posts. I am missing something on how that works – Spodg Feb 10 '22 at 03:41
  • It seems as though WordSet is comprised of posts and yet you can update posts through WordSet. I didn't realize you could then pull out posts with the changes. – Spodg Feb 10 '22 at 06:17
  • No worries. Btw, I think you can precompute ```$mag1``` in ```ConvertTo-WordSet``` as well for another tiny shaving off the run time - it’s basically the sqrt of the sum of the squares of the word counts in a given set. – mclayton Feb 10 '22 at 07:22
0

I tried other ways to measure the similarity of titles including word frequency vs frequency of words in a specific title. I think comparing the titles is a topic for a separate post though. I still like the idea of only looping once through.

@MikeSh - based on your answer this is what I came up with.

Function Get-SimTitles([psobject]$NewPosts) {

  $i=0
  $end = $NewPosts.Count - 1
   
  For($i =0; $i -lt $end; $i++){
    
      $k=$i+1        
      $k..$end | Where{{$NewPosts[$i].source -ne $NewPosts[$_].source}} |
      Where-Object {(Measure-TitleSimilarity $NewPosts[$i].title.split(' ') $NewPosts[$_].title.split(' ')) -gt .35}  |
       & {process {$NewPosts[$_].SimTitles = $NewPosts[$_].SimTitles + 1; $NewPosts[$i].SimTitles+=1} }
       } 
                       
 }  
Spodg
  • 33
  • 6