Metric for "similar enough" objects

Question

Let's say we have two signal spaces S1 and S2, each containing hundreds, perhaps thousands of signals. S1 are all signals that are send or received by a given system (plane, car etc.), S2 are all signals that are send or received by software modules of a subsystem inside the system. Each signal has a specific set of dozens of properties like signal name, cycle time, voltage etc.

Now I want to check if each signal in S1 has at least one representation in S2, meaning that all properties of a signal in S1 are equal to all properties of a signal in S2. This sounded easy at first, as one could iterate through the signals and their properties and check if there is an equivalent signal somewhere. But it turned out that on both sides (S1 and S2 signals) there can be wrong specifications, so a signal pair that would belong together can't be identified as such.

Example:

K1 = {Name:= CAN_1234_UHV; Voltage:= 0.8 mV; Cycle=100ms}

D1 = {Name:= CAN_1234_UH; Voltage:= 0.8mV; Cycle=100 ms}

A human beeing can see quite easily that these two signals may very well fit together although there are some spelling mistakes.

So what I did is devising an algorithm that calculates a distance metric of the strings of each property, mapping the similarity to a certain propability that this specific property is equal to the same property of the other signal, calculate the average and categorize the signal as equal if this propbability reaches a certain threshold.

This yielded terrible results because two signals could be classified as equal because certain properties had values that were very common in the signal space. So the next step would be to weight these properties (signalname is better suited than cycle time to identify the signal).

This whole process seems quite arbitrary to me because I don't really know the probabilities and weights that would yield a good result. So I have a feeling that this could be tackled by a machine learning algorithm because it could derive the probabilities and weights from training data.

So, in conclusion, would it be feasible to use a machine learning algorithm to identify signals as "similar enough" so that they can be classified as equal. I'm aware that this question can't be answered generally, I'm more interested in "gut feelings" and "nudges in the right direction".

Thanks in advance

Hi @JonBlumfeld, if this answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. and if it did not solve your question, please let me know what why. I can improve my answer further. — Programmer, Sep 22 '16 at 16:04

Programmer · Answer 1 · 2016-09-19T09:53:44.307

Solution 1 - You can use Apache Solr.

You can save (index) all your signals in Apache Solr where each property of your signal will be a stored as Solr's field.

Example:
K1 = {Name:= CAN_1234_UHV; Voltage:= 0.8 mV; Cycle=100ms}
D1 = {Name:= CAN_1234_UH; Voltage:= 0.8mV; Cycle=100 ms}

K1 and D1 is a document in Solr. Name, Voltage, Cycle will be Solr's Field.

Then you can use Solr's MoreLikeThis feature to identify similar Signals.

This yielded terrible results because two signals could be classified as equal because certain properties had values that were very common in the signal space. So the next step would be to weight these properties (signalname is better suited than cycle time to identify the signal).

For this Check mlt.qf below.

Solr give a number of Common Parameters for MoreLikeThis which can be tuned as per your needs.

mlt.fl Specifies the fields to use for similarity. If possible, these should have stored termVectors.
mlt.mintf Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the source document.
mlt.mindf Specifies the Minimum Document Frequency, the frequency at which words will be ignored which do not occur in at least this many documents.
mlt.maxdf Specifies the Maximum Document Frequency, the frequency at which words will be ignored which occur in more than this many documents.
mlt.minwl Sets the minimum word length below which words will be ignored.
mlt.maxwl Sets the maximum word length above which words will be ignored.
mlt.maxqt Sets the maximum number of query terms that will be included in any generated query.
mlt.maxntp Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support.
mlt.boost Specifies if the query will be boosted by the interesting term relevance. It can be either "true" or "false".
mlt.qf Query fields and their boosts using the same format as that used by the DisMaxRequestHandler. These fields must also be specified in mlt.fl.

Solution 2 - Write your own Solution.

You can write a custom solution for this problem using these algorithms.

Levenshtein Distance - Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
Hamming Distance - In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
Smith–Waterman-algorithm - The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings or nucleotide or protein sequences. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.
Sørensen–Dice Coefficient - is a statistic used for comparing the similarity of two samples

Metric for "similar enough" objects

1 Answers1

Solution 1 - You can use Apache Solr.

Solution 2 - Write your own Solution.