Extracting sub-arrays safely/idiomatically in C#

Question

I am building a natural language processor in C#, and many 'words' in our database are actually multiple-word phrases that refer to one noun or action. Please, no discussion on this design call, suffice it to say it is not changeable at this time. I have string arrays of related words (chunks) of the sentence that I need to test for these phrases and words. What is an appropriately idiomatic way to handle sub-array extraction so I run the least risk of overflow errors and the like?

To give an example of the desired logic, let me step through a run with a sample chunk. For our purposes, assume that the only multiple-word phrase from the database is 'quick brown'.

Full phrase: The quick brown fox -> encoded as {"The", "quick", "brown", "fox"}
First iteration: Test "The quick brown fox" -> returns nothing
Second iteration: Test "The quick brown" -> returns nothing
Third iteration: Test "The quick" -> returns nothing
Fourth iteration: Test "The" -> returns value
Fifth iteration: Test "quick brown fox" -> returns nothing
Sixth iteration: Test "quick brown" -> returns value
Seventh iteration: Test "fox" -> returns value

Sum all returned values and return.

I have some ideas of how to go about this but the more I look at things the more I am really getting worried about array addressing errors and other such horrors plaguing my code. The phrase is coming in as a string array, but I'm fine with putting it to IEnumerable. My only concern there lies in an Enumerable's lack of an index.

And why two-words phrase was not encoded initially as {"The", "quick brown", "fox"} ?? — sll, Aug 15 '11 at 20:41
Are you asking about how to generate your test phrases for each iteration? — Philipp Schmid, Aug 15 '11 at 20:43
@sllev: Because we don't know it's a multi-word phrase until it gets to the database and comes back. Before that point it simply gets encoded based on related grammatical role (subject clause, verb clause, etc). It is from there broken into a string array of the words inside that clause. — tmesser, Aug 15 '11 at 20:43
@Philip: Sort of? I know how to generate strings from the sub-arrays. The problem is making sure I get the sub-arrays in such a way that I don't blow something up while fuddling around with integer addressing or something. — tmesser, Aug 15 '11 at 20:45

Jim Mischel · Answer 1 · 2011-08-15T22:11:32.620

2

This sounds like a perfect application for the Aho-Corasick string matching algorithm. I have a dictionary of about 10 million phrases that I run short strings through. It's incredibly fast. With a single pass it will tell you all of the matching phrases. So if "the," "fox," and "quick brown" were all in the dictionary, a single pass would return all three indexes.

It's pretty easy to implement. Find the original paper online and you can build it in an afternoon.

Efficient String Matching: An Aid to Bibliographic Search

edited Aug 15 '11 at 22:11

answered Aug 15 '11 at 20:43

Jim Mischel

131,090
20
188
351

This is great (and I am looking up an implementation of it right now), but how precisely do I obtain these three strings out of the array I'm passed in, particularly if both "quick" and "brown" are also in the dictionary? – tmesser Aug 15 '11 at 20:52
Read the original paper, which will explain things much better than I can. And if you run across my published implementation, ignore it. I misunderstood the algorithm when I published that, and haven't had the opportunity to correct it. See my edited response for a link to the original paper. – Jim Mischel Aug 15 '11 at 22:10
Here's an implementation of the algorithm in C#: http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=869 – Jim Mischel Sep 15 '11 at 15:22

score 1 · Answer 2 · answered Aug 15 '11 at 20:40

1

Would ArraySegment or a DelimitedArray help?

answered Aug 15 '11 at 20:40

Mark Cidade

98,437
31
224
236

Experimenting with this now. My initial reaction is not the best, since the constructor is (baseArray, offset, count). If the count variable is off for whatever reason I would still bust the top of the array and get an exception, would I not? I would feel more comfortable if I could assign the starting and ending offset, but I'll keep playing with this to see if I can work with it better than the ugliness I had coded before. – tmesser Aug 15 '11 at 20:49
count = endIndex - startIndex – Mark Cidade Aug 15 '11 at 20:53

score 1 · Answer 3 · answered Aug 15 '11 at 20:48

1

How about something like this:

    string[] words = new string[] { "The", "quick", "brown", "fox" };

    for (int start = 0; start < words.Length - 2; start++) // at least one word
    {
        for (int end = start + 1; end < words.Length - 1; end++)
        {
            ArraySegment<string> segment = new ArraySegment<string>(words, start, end - start);
            // test segment
        }
    }

This assumes you can use the ArraySegment segment for your test.

answered Aug 15 '11 at 20:48

Philipp Schmid

5,778
5
44
66

I initially did something like this (I hadn't used ArraySegment, but it was an embedded loop), and basically it works great until you hit a multi-word phrase, at which point you have to do some disgusting hacking with the start parameter to make sure it stays on target. It makes the code very hard to read and seriously damages my confidence that it will work most situations beyond the unit tests I've thrown together. – tmesser Aug 15 '11 at 20:59
Can you give an example of a 'multi-word phrase'? What problems did you encounter? – Philipp Schmid Aug 15 '11 at 21:16
I was referring to something like 'brown fox' but it does not matter now, between your code sample and the DelimitedArray given by Mark I have a procedure I'm comfortable with. I made a few changes to the DelimitedArray to make it behave how I want. I'll be submitting an edit to your post to demonstrate what I did; please watch out for it. – tmesser Aug 15 '11 at 21:27

score 0 · Accepted Answer · answered Aug 16 '11 at 14:07

The path forward here lay in combining Mark's and Philipp's answers. Under ideal circumstances I would have edited one of their posts with it but it appears as though my edits were denied.

Anyway, I took the DelimitedArray that Mark linked and changed a few things in it:

Constructor changed to:

    public DelimitedArray(T[] array, int offset, int count, bool throwErrors = false)
    {
        this.array = array;
        this.offset = offset;
        this.count = count;
        this.throwErrors = throwErrors;
    }

Index reference changed to:

public T this[int index]
    {
        get
        {
            int idx = this.offset + index;
            if (idx > this.Count - 1 || idx < 0)
            {
                if (throwErrors == true)
                    throw new IndexOutOfRangeException("Index '" + idx + "' was outside the bounds of the array.");
                return default(T);
            }
            return this.array[idx];
        }
    }

I then worked that in to Philipp's loop usage. This becomes:

        for (var start = 0; start < words.Length - 2; start++) // at least one word
        {
            for (var end = start + 1; end < words.Length - 1; end++)
            {
                var segment = new DelimitedArray<string>(words, start, end - start);
                lemma = string.Join(" ", segment.GetEnumerator()); // get the word/phrase to test
                result = this.DoTheTest(lemma);

                if (result > 0)
                {
                    // Add the new result
                    ret = ret + result;

                    // Move the start sentinel up, mindful of the +1 that will happen at the end of the loop
                    start = start + segment.Count - 1;
                    // And instantly finish the end sentinel; we're done here.
                    end = words.Length;
                }
            }
        }

If I could accept more than one answer I'd mark both of their answers but as both of them are incomplete I will have to accept my own when I am able to do so tomorrow.

Extracting sub-arrays safely/idiomatically in C#

4 Answers4