0

I have a collection of texts from some authors. Each author has a unique signature or link that occurs in all of their texts.

Example for Author1:

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

Expected output for Author1 is: @jhsad.sadas.com


Example for Author2:

$texts=['This is some random string representative of non-signature text.

This is the
*author\'s* signature.',
'Different message body text.      This is the
*author\'s* signature.

This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];

Expected output for Author2 is:

This is the
 *author's* signature.

Pay particular notice to the fact there there are no reliable identifying characters (or positions) that signify the start or end of the signature. It could be a url, a Twitter mention, any kind of plain text, etc. of any length containing any sequence of characters that occurs at the start, end, or middle of the string.

I am seeking a method that will extract the longest substring that exists in all $text elements for a single author.

It is expected, for the sake of this task, that all authors WILL have a signature substring that exists in every post/text.

IDEA: I'm thinking of converting words to vectors and finding similarity between each texts. We can use cosine similarity to find the signatures. I think the solution must be some thing like this idea.

mickmackusa's commented code captures the essence of what is desired, but I would like to see if there are other ways to achieve the desired result.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
mrmrn
  • 33
  • 6
  • You need to find `@jhsad.sadas.com` or just confirm the string has it? Are you allowing loose matches, e.g. `@jhsad.sadas.com.uk`? `@jhsad\.sadas\.com\b` would work, or if the domain is a variable use `preg_quote` on it. – chris85 Oct 13 '17 at 11:19
  • @chris85 , I want to find an authors signature in his texts. I dont know what it could be and where he will use it. – mrmrn Oct 13 '17 at 11:28
  • If you don't know what it is than how can you identify it? – chris85 Oct 13 '17 at 11:35
  • @chris85, by some methods like cosine similarity – mrmrn Oct 13 '17 at 11:58
  • To clarify why this page should be reopened, I've slapped together this [demo](http://sandbox.onlinephpfunctions.com/code/31de29d279ff7716c30fac49b7aa94423a4517af) It sure doesn't feel like an efficient method, but I believe it conveys the right message. @mrmm give us an AMEN if this is what you mean and perhaps we can reopen your question (I've already voted to reopen). – mickmackusa Oct 15 '17 at 22:36
  • @WiktorStribiżew The OP has clarified that there is no static signifier about the signature. It is merely a matter of searching for the longest common substring across multiple strings to determine what is the signature. Please consider reopening. – mickmackusa Oct 15 '17 at 22:39
  • After a short ride in the car, I have realized a handful of ways to improve my earlier _napkin&pen_ snippet. However, no matter how clever the process, the result will still be a "best guess" based on substring length which will not be 100% trustworthy. I mean, if the signature is merely: `site.com` or `namaste` then another common and longer substring that exists in all texts like `something` will win on length. – mickmackusa Oct 16 '17 at 00:21
  • 1
    @mickmackusa I doubt it makes sense to try and find any solution here. OP does not know what the signature is like. Besides, there is no effort to really solve the issue. It is off-topic and unclear. – Wiktor Stribiżew Oct 30 '17 at 08:07
  • @WiktorStribiżew Can we change the duplicate to Too Broad then? Sorry for the trouble, it just seems inappropriate. – mickmackusa Oct 30 '17 at 08:09
  • @mickmackusa Your reopen vote has been overridden - I guess reviewers decided that the question should stay closed. – Wiktor Stribiżew Oct 30 '17 at 08:19
  • I can agree with it being closed as Too Broad, but your provided link has been proven unhelpful by the OP. You've got the special powers to change the closure right? – mickmackusa Oct 30 '17 at 08:26
  • @mickmackusa, your code is very useful and in this step it is solution for my question. I will appreciate if you post it as answer for another users. thanks – mrmrn Oct 31 '17 at 11:25
  • I cannot post an answer until your question is reopened. – mickmackusa Oct 31 '17 at 11:30
  • @mickmackusa, I just wanted an idea and your code had an insight in it. – mrmrn Nov 06 '17 at 19:10
  • @chris85 To answer your earlier question, the way to identify the signature is to scan the texts by a given author and extract the longest substring and assume it to be the signature. -- It is a "best guess" kind of technique. – mickmackusa Nov 07 '17 at 02:50

2 Answers2

2

You can use preg_match() with a regex to achieve this.

$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf";

preg_match("/\@[^\s]+/", $str, $match);

var_dump($match); //Will output the signature
WasteD
  • 758
  • 4
  • 24
  • here @jhsad.sadas.com is an example. I dont know what is the real signature of that author!what I have is just some text from that author and I know there is a signature in it – mrmrn Oct 13 '17 at 11:23
  • 1
    @chris85 Yeah I changed it now! – WasteD Oct 13 '17 at 11:24
  • 1
    @mrmrn But does the signature always start with an @? – WasteD Oct 13 '17 at 11:27
  • @WaseD no.it can be any thing. it can begin with @ or begin with a phrase or just be his nikname – mrmrn Oct 13 '17 at 11:30
  • 1
    @mrmrn I think if thats the case then there is no chance to get the signature. I mean how if everything is just the same format. – WasteD Oct 13 '17 at 11:31
  • @WaseD, so I tagged the question with machine learning. its not a routine regex problem. – mrmrn Oct 13 '17 at 11:37
  • 2
    @mrmrn How on Earth are we going to know what the signature is if you don't know? Is there at least a set of rules that are common? E.g. "could start with a `@`" or "could be the final word on its own line"? If not, I honestly believe you need to restructure your backend so that the _"signatures"_ are all of the same format. – JustCarty Oct 13 '17 at 11:39
  • @mrmrn But it has nothing to do with machine-learning either. – WasteD Oct 13 '17 at 11:44
  • I thing with converting words to vector and finding similarity between each texts, we can use cosine similarity to find the signatures. – mrmrn Oct 13 '17 at 11:51
  • @mrmrn I think I can't help you with this because I know nearly nothing about it. – WasteD Oct 13 '17 at 11:56
  • @WasteD The question has been updated, clarified, and reopened. Please take a moment to update or delete your question as your post does not offer a solution for the OP's question. – mickmackusa Nov 07 '17 at 02:46
  • @JustCarty Please see the updated question. This process may actually be a step toward saving authors' signatures in a separate way in the backend (assuming the OP doesn't want to just ask the authors to write their signatures into a separate Signature field on a profile form). – mickmackusa Nov 07 '17 at 02:48
2

Here is my thinking:

  1. Sort an author's collection of posts by string length (ascending) so that you are working from smaller texts to larger texts.
  2. Split each post's text on one or more white-space characters, so that you are only handling wholly non-white-space substrings during processing.
  3. Find matching substrings that occur in each subsequent post versus an ever-narrowing array of substrings (overlaps).
  4. Group the consecutive matching substrings by analyzing their index value.
  5. "Reconstitute" the grouped consecutive substrings into their original string form (trimmed of leading and trailing white-space characters, of course).
  6. Sort the reconstituted strings by string length (descending) so that the longest string is assigned the 0 index.
  7. Print to screen the substring that is assumed to be the author's signature (as a best guess) based on commonality and length.

Code: (Demo)

$posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

$posts['Author2'] = ['This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
        'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
        'Finally, this is unwanted stuff. This is the
 *author\'s* signature.'];

foreach ($posts as $author => $texts) {
    echo "Author: $author\n";
    
    usort($texts, function($a, $b) {
        return strlen($a) <=> strlen($b);  // sort ASC by strlen; mb_strlen probably isn't advantageous
    });
    var_export($texts);
    echo "\n";

    foreach ($texts as $index => $string) {
        if (!$index) {
            $overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY);  // declare with all non-white-space substrings from first text
        } else {
            $overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY));  // filter word bank using narrowing number of words
        }
    }
    var_export($overlaps);
    echo "\n";
    
    // batch consecutive substrings
    $group = null;
    $consecutives = [];  // clear previous iteration's data
    foreach ($overlaps as $i => $word) {
        if ($group === null || $i - $last > 1) {
            $group = $i;
        }
        $last = $i;
        $consecutives[$group][] = $word;
    }
    var_export($consecutives);
    echo "\n";
    
    foreach($consecutives as $words){
        // match potential signatures in first text for measurement:
        if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) {  // make alternatives characters literal using \Q & \E
            $potential_signatures = $out[0];
        }
    }
    usort($potential_signatures, function($a,$b){
        return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
    });
    
    echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}

Output:

Author: Author1
array (
  0 => 'sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
  1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
  2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
  11 => '@jhsad.sadas.com',
)
array (
  11 => 
  array (
    0 => '@jhsad.sadas.com',
  ),
)
Assumed Signature: @jhsad.sadas.com

Author: Author2
array (
  0 => 'Finally, this is unwanted stuff. This is the
 *author\'s* signature.',
  1 => 'This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
  2 => 'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
)
array (
  2 => 'is',
  5 => 'This',
  6 => 'is',
  7 => 'the',
  8 => '*author\'s*',
  9 => 'signature.',
)
array (
  2 => 
  array (
    0 => 'is',
  ),
  5 => 
  array (
    0 => 'This',
    1 => 'is',
    2 => 'the',
    3 => '*author\'s*',
    4 => 'signature.',
  ),
)
Assumed Signature: This is the
 *author's* signature.
mickmackusa
  • 43,625
  • 12
  • 83
  • 136