Get index of first non standard english character

Question

I'm trying to process a string and separate it into two parts when i find a character that is not of the standard english alphabet. For example This is a stríng with áccents. and i need to know the index of the first or every character with accent (í).

I think the solution is somewhere between System.Text.Encoding and System.Globalization but i miss something...

The important thing is to know if it's a character with accent and if possible exclude space.

void Main()
{
    var str = "This is a stríng with áccents.";
    var strBeforeFirstAccent = str.Substring(0, getIndexOfFirstCharWithAccent(str));
    Console.WriteLine(strBeforeFirstAccent);

}

int getIndexOfFirstCharWithAccent(string str){
    //Process logic
    return 13;
}

Thanks!

What have you tried accomplishing so far? I don't think `return 13` shows any attempt, at all. — Yuval Itzchakov, Jun 05 '15 at 14:56
Sorry i don't know how to do it. I was hoping for someone who have done it. — CodeArtist, Jun 05 '15 at 14:57
@GrantWinney That's a very naïve way of looking at it. ASCII is a **very limited** character set, and C#/.NET doesn't even *use ASCII* by default. C#/.NET use Unicode in UTF-16 format, which has far more `non-standard English` characters than the ASCII range of 128-165. (How about, for example, an `e` with a diacritical accent over it, which is a completely different representation than ASCII `é` or ASCII 130, which doesn't even work for my PC.) — Der Kommissar, Jun 05 '15 at 15:05
@YuvalItzchakov, From a TDD perspective, `return 13` is the perfect code for just that one string. It only becomes necessary to make the code more complex when a second string is introduced. — David Arno, Jun 05 '15 at 15:14

score 2 · Accepted Answer · answered Jun 05 '15 at 14:59

2

The regex [^a-zA-Z ] will find characters other than non-accented Roman letters and spaces.

So:

var regex = new Regex("[^a-zA-Z ]");
var match = regex.Match("This is a stríng with áccents.");

will return í

and match.Index will contain its location.

answered Jun 05 '15 at 14:59

David Arno

42,717
16
86
131

Though I'm not a fan of Regular Expressions, this is a good place for them. Also, note: this only returns the **first** instance ever. If you need others then you'll have to capture them. (That said, this answer still meets the exact criteria of the question, +1 for that.) You may also want to consider adding `0-9` to that negation block, and various other symbols. – Der Kommissar Jun 05 '15 at 15:08
@EBrown, I agree. Regex's have their place, but they get over-used all too often. My concern here with my own answer is that eg `,` in the string would break it. It's an answer, but I'm watching for someone else to post a better answer ;) – David Arno Jun 05 '15 at 15:11
1

I just finished my non-Regex solution, I think both of them have great merit. (Your Regex solution is fewer lines of code, my non-Regex solution will not improperly tag symbols.) – Der Kommissar Jun 05 '15 at 15:29
`var regex = new Regex("[^\u0000-\u007F]"); var matches=regex.Matches("This is a string with áccents.");` – Robert McKee Jun 05 '15 at 15:40

Der Kommissar · Answer 2 · 2015-06-05T18:17:08.567

Another possible solution (fixed/adapted from Cortright's answer) is to enumerate the Unicode pairs.

const string input = "This is a stríng with áccents .";
byte[] array = Encoding.Unicode.GetBytes(input);

for (int i = 0; i < array.Length; i += 2)
{
    if (((array[i]) | (array[i + 1] << 8)) > 128)
    {
        Console.WriteLine((array[i] | (array[i + 1] << 8)) + " at index " + (i / 2) + " is not within the ASCII range");
    }
}

This prints a list of all the numerical values that are outside the range of allowed ASCII values. (I am taking the original definition of ASCII as 0-127.)

Personally, I recommend David Arno's solution. I only post this as a potential alternative. (It could be faster, if you benchmark it. Just as well, it could also be more manageable.)

Update: I did just test it, and it seems it still properly recognizes characters within the higher range (U+10000 - U+10FFFF) as not being allowed. This is, in fact, due to the surrogate pairs being outside the ASCII range as well. The only issue is that it recognizes them as two character pairs, not one.

Output:

237 at index 13 is not within the ASCII range
225 at index 22 is not within the ASCII range
55378 at index 30 is not within the ASCII range
57186 at index 31 is not within the ASCII range

Get index of first non standard english character

2 Answers2