14

Given a generic List I would need some kind of index (in the database sense) that would allow me fast retrieval. The keys for this index would not be unique, so I can't use a dictionary. Here's what I have in mind: Given a class Foo { P1, P2, P3 } that may have data like this

{ "aaa", 111, "yes" }
{ "aaa", 112, "no" }
{ "bbb", 111, "no" }
{ "bbb", 220, "yes" }
{ "bbb", 220, "no" }
{ "ccc", 300, "yes" }

I would need to quickly access all the records where P1 is "bbb" (3rd,4th, and 5th) or all the ones where P2 is 111 (1st and 3rd). I could use a sorted List but if I need more than one way of sorting / indexing I would end up with duplicated lists.

Is there something built-in into the .NET framework or maybe an OS library that would do something like this? Thanks.

P.S. I mentioned "sorted List" with the idea that a sorted list will return / find an item much faster. I do not need the list to be necessarily sorted; I'm just looking for fast retrieval / finding.

Patrick Karcher
  • 22,995
  • 5
  • 52
  • 66
pbz
  • 8,865
  • 14
  • 56
  • 70

8 Answers8

15

Don't ever forget this principle: Make it correct, make it clear, make it concise, make it fast. In that order. So, first code up the naive implementation:

static IEnumerable<T> GetByIndex<T>(
    List<T> list,
    Func<T, TIndex> func,
    TIndex key
) {
    return list.Where(x => func(x) == key);
}

Usage:

List<Test> tests = new List<Test>() {
            new Test { Name = "aaa", Value = 111, Valid = Valid.Yes },
            new Test { Name = "aaa", Value = 111, Valid = Valid.Yes },
            new Test { Name = "bbb", Value = 112, Valid = Valid.No },
            new Test { Name = "bbb", Value = 111, Valid = Valid.No },
            new Test { Name = "bbb", Value = 220, Valid = Valid.No },
            new Test { Name = "ccc", Value = 220, Valid = Valid.Yes }
};
IEnumerable<Test> lookup = GetByIndex(tests, x => x.Name, "bbb");

The above is correct, clear and concise. Almost surely it is fast enough for your purposes.

So, as far as making it fast you must first measure:

  1. Establish reasonable performance criterion.
  2. Establish a test-bed of real-world data.
  3. Profile the simple approach against the test-bed of real-world data. Note here that profiling includes deducing whether or not this functionality is a bottleneck in your application.

Then, if and only if this is not fast enough for you should you try to optimize. It wouldn't be too hard to implement an IndexedList<T> : ICollection<T> that would allow you to index off of various properties.

Here is a naive implementation that could get you started:

class IndexedList<T> : IEnumerable<T> {
    List<T> _list;
    Dictionary<string, Dictionary<object, List<T>>> _dictionary;
    Dictionary<string, Func<T, object>> _propertyDictionary;

    public IndexedList(IEnumerable<string> propertyNames) : this(propertyNames, new List<T>()) { }

    public IndexedList(IEnumerable<string> propertyNames, IEnumerable<T> source) {
        _list = new List<T>();
        _dictionary = new Dictionary<string, Dictionary<object, List<T>>>();
        _propertyDictionary = BuildPropertyDictionary(propertyNames);
        foreach (var item in source) {
            Add(item);
        }
    }

    static Dictionary<string, Func<T, object>> BuildPropertyDictionary(IEnumerable<string> keys) {
        var propertyDictionary = new Dictionary<string,Func<T,object>>();
        foreach (string key in keys) {
            ParameterExpression parameter = Expression.Parameter(typeof(T), "parameter");
            Expression property = Expression.Property(parameter, key);
            Expression converted = Expression.Convert(property, typeof(object));
            Func<T, object> func = Expression.Lambda<Func<T, object>>(converted, parameter).Compile();
            propertyDictionary.Add(key, func);
        }
        return propertyDictionary;
    }

    public void Add(T item) {
        _list.Add(item);
        foreach (var kvp in _propertyDictionary) {
            object key = kvp.Value(item);
            Dictionary<object, List<T>> propertyIndex;
            if (!_dictionary.TryGetValue(kvp.Key, out propertyIndex)) {
                propertyIndex = new Dictionary<object, List<T>>();
                _dictionary.Add(kvp.Key, propertyIndex);
            }
            List<T> list;
            if (!propertyIndex.TryGetValue(key, out list)) {
                list = new List<T>();
                propertyIndex.Add(key, list);
            }
            propertyIndex[key].Add(item);
        }
    }

    public IEnumerable<T> GetByIndex<TIndex>(string propertyName, TIndex index) {
        return _dictionary[propertyName][index];
    }

    public IEnumerator<T> GetEnumerator() {
        return _list.GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator() {
        return GetEnumerator();
    }
}

Usage:

List<Test> tests = new List<Test>() {
            new Test { Name = "aaa", Value = 111, Valid = Valid.Yes },
            new Test { Name = "aaa", Value = 111, Valid = Valid.Yes },
            new Test { Name = "bbb", Value = 112, Valid = Valid.No },
            new Test { Name = "bbb", Value = 111, Valid = Valid.No },
            new Test { Name = "bbb", Value = 220, Valid = Valid.No },
            new Test { Name = "ccc", Value = 220, Valid = Valid.Yes }
};
// build an IndexedList<Text> indexed by Name and Value
IndexedList<Test> indexed = new IndexedList<Test>(new List<string>() { "Name", "Value" }, tests);
// lookup where Name == "bbb"
foreach (var result in indexed.GetByIndex("Name", "bbb")) {
    Console.WriteLine(result.Value);
}

But see, the reason you don't do this unless the naive implementation is not already fast enough is because of the additional complexity you just added to your system. You just added new code to maintain, new code to test and might not gain anything if this isn't faster on your real-world data or is not a bottleneck of your application.

jason
  • 236,483
  • 35
  • 423
  • 525
12

(Edited to elaborate on collection-based strategy)

There is no intrinsic structure in .NET for looking up using various indexes. Here are two good strategies:

Option 1: LINQ, for flexibility and simplicity
For simplicity and a lot of other integrated options, create a List (or something else that implements IEnumerable) of custom types and use LINQ to do your on demand lookups. Note that you could use anonymous types if that's convenient for you. You can also have your data in an XML structure and still do all this. You'll likely be able to get your data, do your lookups, and manipulate the results in a small amount of clear code. In .Net 4.0 you can use parallel Ling (PLINQ) to effortlessly have this process take advantage of multi-core processing.

List<foo> bigFooList = new List<foo>  
{  
     new Foo {"aaa", 111, "yes"},  
     new Foo {"aaa", 112, "no"},  
     new Foo {"bbb", 111, "no"},  
     new Foo {"bbb", 220, "yes"},  
     new Foo {"bbb", 220, "no"},  
     new Foo {"ccc", 300, "yes"}  
};    
var smallFooList = From f In bigFooList Where f.P2 = 220 Select f; 

Option 2: Multiple collections, for indexed look-up power.
If you're doing a lot of lookups on a large set and need power, you can use multiple collections to achieve faster lookups. The tricky part is your requirement that the index values can be duplicated. Here are some strategies:

  • Check out the Lookup class. Create your List. Then for each fields for which you want an indexed lookup, create a Lookup object. They cannot be constructed, but are derived from your IEnumerable collection:
    Lookup<string, foo> LookupP1 = (Lookup<string, foo>) fooList.ToLookup(f => f.P1, f => p)
    See the link for syntax for retrieving your items. Basically LookupP1 contains IGrouping objects for each unique value of P1, keyed on that P1 value. You iterate through that object to get your matching items. A key attribute of Lookup objects is that they are immutable; so each time you add/subtract from your fooList, you'll have to redo all your Lookup objects. But if you seldom alter your fooList, this is the way to go.
  • Create a Dictionary<T, List<foo>> for each field upon which you will need to search by index, where T is the type of that value. So for your example we would create:
    var FoosByP1 = new Dictionary<String,List<foo>>
    var FoosByP2 = new Dictionary<Int32,List<foo>> etc.
    Then add to FoosByP1, keyed on each unique P1 value, a List containing all the foo items where P1 has that value. (ex. keyed by "aaa", a List containing all foo objects for which P1 is "aaa".) Repeat for each Foo field. Based on your data, FoosByP1You would contain 3 List objects, containing 2, 3 and 1 foo items respectively. With this scheme you can then retrieve very quickly. (A dictionary is basically a hash table).
    The main catch is that your data would be duplicated in each of these dictionaries, which may or may not be a problem. If Foo has 20 fields and you have many foo items, you can save memory by having a central dictionary with a numeric key and all your foo items, and the individual indexed dictionaries would instead be Dictionary<T, List<Int32>>, where the integer would be the index of a Foo item in your central dictionary. This would save memory and still be quite fast.
    Whether you have a central dictionary or not, building your Dictonaries will take some cpu cycles, but once you have them you'll be in great shape. And use Linq to build your dictionaries!
Patrick Karcher
  • 22,995
  • 5
  • 52
  • 66
  • I don't need them to be sorted per se, I just need fast access to these subsets. – pbz Jan 27 '10 at 00:43
  • How's that different from just looping through the list with a foreach? As far as I know that will end up being a loop in the end, i.e. no use of any index... – pbz Jan 27 '10 at 01:00
  • Your Dictionary> is what I had in mind. In my particular case i4o turned out to be sufficient, but this may help someone else in the future. Thanks. – pbz Jan 28 '10 at 00:25
3

I've never actually had a chance to use it, but you may try i4o. Its supposed to provide indexes for in-memory objects for usage with Linq. You specify the indexes for a class using either attributes or as part of constructing the indexer, then you create an IndexableCollection.

At that point, you just query the collection using Linq, and the indexes work behind the scenes to optomize the access patterns for the data.

Chris Pitman
  • 12,990
  • 3
  • 41
  • 56
  • 1
    The idea behind i4o is very neat and I think it should be built into the framework. Unfortunately, as it is right now it's limited to a simple single where condition (i.e. only where something="value", no && or ||). For my case it was sufficient though. Thanks. – pbz Jan 28 '10 at 00:24
  • The link to i4o is leading to Microsoft.... But this is in github: https://github.com/ericksoa/i4o (and ~ 12 years old),.... – Luuk Jun 25 '22 at 16:12
2

One route would be to just use an embedded relational database a la SQLite (there's an ADO.NET binding here: http://sqlite.phxsoftware.com/)

Most data structures aren't going to meet your requirements unless you're willing to re-sort the list/whatever each time as you need a different ordering.

Joe
  • 41,484
  • 20
  • 104
  • 125
1

You might want to consider something like Lucene.Net, an indexing and search library. I don't know if this may be a more complex solution than you were looking for, but it would definitely meet your performance needs.

jamesaharvey
  • 14,023
  • 15
  • 52
  • 63
0

I know you said you couldn't use a Dictionary, but would the following work?

For your example data set:

{ "aaa", 111, "yes" }
{ "aaa", 112, "no"  }
{ "bbb", 111, "no"  }
{ "bbb", 220, "yes" }
{ "bbb", 220, "no"  }
{ "ccc", 300, "yes" }

You could use the following:

var p1Lookup = new Dictionary<string,int []>();
p1Lookup.Add( "aaa", new int [] {0, 1} );
p1Lookup.Add( "bbb", new int [] {2, 3, 4} );
p1Lookup.Add( "ccc", new int [] {5} );

var p2Lookup = new Dictionary<int,int []>();
p1Lookup.Add( 111, new int [] {0, 2} );
p1Lookup.Add( 112, new int [] {1} );
p1Lookup.Add( 220, new int [] {3, 4} );
p1Lookup.Add( 300, new int [] {5} );

var p3Lookup = new Dictionary<int,int []>();
p1Lookup.Add( "yes", new int [] {0, 3, 5} );
p1Lookup.Add(  "no", new int [] {1, 2, 4} );

Depending on the usage, you could build the look-up dictionaries just once

Joseph Gordon
  • 2,356
  • 2
  • 23
  • 27
0

If you only need to iterate the list once, but search it many times, and change it very little (as DB indexes are best at). A dictionary would be very fast once built. My method doesn't create duplicates.

var indexDict = new Dictionary<string, List<int>>();

for(int ct = 0; ct < pList.length; ct++)
{
    var item = pList[ct];

    if (!indexDict.ContainsKey(item.toIndexBy))
    {
        indexDict.Add(item.toIndexBy, new List<int> { ct };
    }
    else
    {
        indexDict[item.toIndexBy].add(ct);
    }
}

Now you have a super fast lookup for the indexes.

So if you want "bbb"'s indexes you could do:

int bbbIndexes = indexDict["bbb"];
Timothy Gonzalez
  • 1,802
  • 21
  • 18
-2

Why not use a HashSet to store the different instances of the Foo object (which will be unique) and then use a LINQ query to retrieve the ones that match the given criteria?

Something like:

var hash = new HashSet<Foo>
{
new Foo { P1 = "aaa", P2 = 111, P3 = "yes"},
new Foo { P1 = "aaa", P2 = 112, P3 = "no"},
new Foo { P1 = "bbb", P2 = 111, P3 = "no"},
new Foo { P1 = "bbb", P2 = 220, P3 = "yes"},
new Foo { P1 = "bbb", P2 = 220, P3 = "no"},
new Foo { P1 = "ccc", P2 = 300, P3 = "yes"},
};

var results = from match in hash
where match.P1 == "aaa"
select match;
Brad Cunningham
  • 6,402
  • 1
  • 32
  • 39
  • Forgot about the sorting need. You could add an order by clause to the LINQ query to handle sorting the resulting list (which is smarter then sorting the whole list first then filtering in most cases) – Brad Cunningham Jan 27 '10 at 00:51
  • How would it know that P1 is indexed? Wouldn't it be just as slow as a foreach? Thanks. – pbz Jan 27 '10 at 01:06
  • -1: This answer solves nothing, it's just like an array, unsorted at that, with extra overhead. Also note that he doesn't say he wants just one row for 111, he wants them all, fast. The above solution, given that none of the objects are actually duplicates, would store them all, and the Linq query would iterate over them all, as with a simple array. The real solution is to first figure out how far you need to go, and then if needed, implement an in-memory database-like structure with multiple indices. – Lasse V. Karlsen Jan 27 '10 at 14:58