Here is my thinking:
- Sort an author's collection of posts by string length (ascending) so that you are working from smaller texts to larger texts.
- Split each post's text on one or more white-space characters, so that you are only handling wholly non-white-space substrings during processing.
- Find matching substrings that occur in each subsequent post versus an ever-narrowing array of substrings (
overlaps
).
- Group the consecutive matching substrings by analyzing their index value.
- "Reconstitute" the grouped consecutive substrings into their original string form (trimmed of leading and trailing white-space characters, of course).
- Sort the reconstituted strings by string length (descending) so that the longest string is assigned the
0
index.
- Print to screen the substring that is assumed to be the author's signature (as a best guess) based on commonality and length.
Code: (Demo)
$posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
$posts['Author2'] = ['This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];
foreach ($posts as $author => $texts) {
echo "Author: $author\n";
usort($texts, function($a, $b) {
return strlen($a) <=> strlen($b); // sort ASC by strlen; mb_strlen probably isn't advantageous
});
var_export($texts);
echo "\n";
foreach ($texts as $index => $string) {
if (!$index) {
$overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY); // declare with all non-white-space substrings from first text
} else {
$overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY)); // filter word bank using narrowing number of words
}
}
var_export($overlaps);
echo "\n";
// batch consecutive substrings
$group = null;
$consecutives = []; // clear previous iteration's data
foreach ($overlaps as $i => $word) {
if ($group === null || $i - $last > 1) {
$group = $i;
}
$last = $i;
$consecutives[$group][] = $word;
}
var_export($consecutives);
echo "\n";
foreach($consecutives as $words){
// match potential signatures in first text for measurement:
if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) { // make alternatives characters literal using \Q & \E
$potential_signatures = $out[0];
}
}
usort($potential_signatures, function($a,$b){
return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
});
echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}
Output:
Author: Author1
array (
0 => 'sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
11 => '@jhsad.sadas.com',
)
array (
11 =>
array (
0 => '@jhsad.sadas.com',
),
)
Assumed Signature: @jhsad.sadas.com
Author: Author2
array (
0 => 'Finally, this is unwanted stuff. This is the
*author\'s* signature.',
1 => 'This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
2 => 'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
)
array (
2 => 'is',
5 => 'This',
6 => 'is',
7 => 'the',
8 => '*author\'s*',
9 => 'signature.',
)
array (
2 =>
array (
0 => 'is',
),
5 =>
array (
0 => 'This',
1 => 'is',
2 => 'the',
3 => '*author\'s*',
4 => 'signature.',
),
)
Assumed Signature: This is the
*author's* signature.