4

Essentially, I have two strings of equal length, let's say 'AGGTCT' and 'AGGCCT' for examples sake. I want to compare them position by position and get a readout of when they do not match. So here I would hope to get 1 out because there is only 1 position where they do not match at position 4. If anyone has ideas for the positional comparison code that would help me a lot to get started.

Thank you!!

Shai
  • 111,146
  • 38
  • 238
  • 371
user2180513
  • 75
  • 1
  • 3
  • 11
  • Yes, I just want the number of mismatches out to apply it for another purpose. – user2180513 May 16 '13 at 12:27
  • 1
    Ok. Then Shai's answer is exactly what you want. – jonhopkins May 16 '13 at 12:27
  • I am guessing by your question that you are dealing with DNA sequences that have only 4 chars - A,G,C,T. In this case I would highly recommend representing them as encoded arrays of doubles, and only converting to strings when you need to show them to the user. All of the operations that you need will become much faster and more convenient – Andrey Rubshtein May 16 '13 at 12:58
  • Thanks. That's a good point, gives me something to look up next. I'm a bit new to all of this, so I always appreciate more pointers. – user2180513 May 16 '13 at 13:30

3 Answers3

11

Use the following syntax to get the number of dissimilar characters for strings of equal size:

sum( str1 ~= str2 )

If you want to be case insensitive, use:

sum( lower(str1) ~= lower(str2) )

The expression str1 ~= str2 performs char-by-char comparison of the two strings, yielding a logical vector of the same size as the strings, with true where they mismatch (using ~=) and false where they match. To get your result simply sum the number of true values (mismatches).

EDIT: if you want to count the number of matching chars you can:

  1. Use "equal to" == operator (instead of "not-equal to" ~= operator):

    sum( str1 == str2 )
    
  2. Subtract the number of mismatch, from the total number:

    numel(str1) - sum( str1 ~= str2 )
    
Eitan T
  • 32,660
  • 14
  • 72
  • 109
Shai
  • 111,146
  • 38
  • 238
  • 371
  • Yes, I just want the number of mismatches out to apply it for another purpose. Shai your code seems to work perfectly for what I need. Thank you! – user2180513 May 16 '13 at 12:26
  • Say you wanted to sum the 'falses' (matches), would you be able to insert something else in the middle of the two strings? – user2180513 May 16 '13 at 12:29
  • @user2180513 You should look at [relation operators](http://www.mathworks.com/help/matlab/matlab_prog/operators.html). You can compare using `==` – Shai May 16 '13 at 12:35
  • 1
    @user2180513 - If this answers your question, then please mark it as correct! (click the check mark) – David K May 16 '13 at 12:45
1

You can compare all the element of the string:

r = all(seq1 == seq2)

This will compare char by char and return true if all the element in the resulting array are true. If the strings can have different sizes you may want to compare the sizes first. An alternative is

r = any(seq1 ~= seq2)

Another solution is to use strcmp:

r = strcmp(seq1, seq2)
Cavaz
  • 2,996
  • 24
  • 38
  • 1
    You should use the `strcmp()`, as it allows you to compare strings of different lengths. While `seq1 == seq2` assumes they have the same length. I.e. you cannot compare it like `'John' == 'Bob'`, but you can use `strcmp('John', 'Bob')`. – Ufos Mar 14 '16 at 13:49
0

Just would like to point out that you are asking to calculate the hamming distance (as you ask for alternatives - the article contains links to some). This is already discussed here. In short the builtin command pdist can do it.

Community
  • 1
  • 1
bdecaf
  • 4,652
  • 23
  • 44