You can build a simple "language model" for this purpose. It will estimate probability of a phrase, and mark phrases with low average per-word probability as unusual.
For word probability estimation, it can use a smoothed word count.
This is how the model could look like:
import re
import numpy as np
from collections import Counter
class LanguageModel:
""" A simple model to measure 'unusualness' of sentences.
delta is a smoothing parameter.
The larger delta is, the higher is the penalty for unseen words.
"""
def __init__(self, delta=0.01):
self.delta = delta
def preprocess(self, sentence):
words = sentence.lower().split()
return [re.sub(r"[^A-Za-z]+", '', word) for word in words]
def fit(self, corpus):
""" Estimate counts from an array of texts """
self.counter_ = Counter(word
for sentence in corpus
for word in self.preprocess(sentence))
self.total_count_ = sum(self.counter_.values())
self.vocabulary_size_ = len(self.counter_.values())
def perplexity(self, sentence):
""" Calculate negative mean log probability of a word in a sentence
The higher this number, the more unusual the sentence is.
"""
words = self.preprocess(sentence)
mean_log_proba = 0.0
for word in words:
# use a smoothed version of "probability" to work with unseen words
word_count = self.counter_.get(word, 0) + self.delta
total_count = self.total_count_ + self.vocabulary_size_ * self.delta
word_probability = word_count / total_count
mean_log_proba += np.log(word_probability) / len(words)
return -mean_log_proba
def relative_perplexity(self, sentence):
""" Perplexity, normalized between 0 (the most usual sentence) and 1 (the most unusual)"""
return (self.perplexity(sentence) - self.min_perplexity) / (self.max_perplexity - self.min_perplexity)
@property
def max_perplexity(self):
""" Perplexity of an unseen word """
return -np.log(self.delta / (self.total_count_ + self.vocabulary_size_ * self.delta))
@property
def min_perplexity(self):
""" Perplexity of the most likely word """
return self.perplexity(self.counter_.most_common(1)[0][0])
You can train this model and apply it to different sentences.
train = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
"At vero eos et accusam et justo duo dolores et ea rebum.",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]
test = ["Felix qui potuit rerum cognoscere causas", # an "unlikely" phrase
'sed diam nonumy eirmod sanctus sit amet', # a "likely" phrase
]
lm = LanguageModel()
lm.fit(train)
for sent in test:
print(lm.perplexity(sent).round(3), sent)
which prints to you
8.525 Felix qui potuit rerum cognoscere causas
3.517 sed diam nonumy eirmod sanctus sit amet
You can see that "unusualness" is higher for the first phrase than for the second, because the second one is made from the training words.
If your corpus of "usual" phrases is large enough, you can switch from 1-gram models I use to N-grams (for English, sensible N is 2 or 3). Alternatively, you can use recurrent neural nets to predict probability of each word conditional on all the previous words. But this requires a really huge training corpus.
If you work with a highly flective language, like Turkish, you can use character-level N-grams instead of a word-level model, or just preprocess your texts using a lemmatization algorithm from NLTK.