1

I have this class (simplified for easy reading)

public class Customer
{
    public string Id {get;set;}
    public Email[] Emails {get;set;}
}

From a external system I get a list with Customers that can contain multiple lines for the same Customer (ID)

Raw input JSON

[
{id: a1, emails:[a,b,c]},
{id: a1, emails:[d]},
{id: b3, emails:[e,f]},
{id: k77, emails:[z,a]}
]

c# code to fetch the Customers

List<Customer> dataInput = CallToExternalService(...);

I want to generate a unique list of Customers via LINQ that contains a merged list of all the customers emails. I know how to get a list of unique customers

dataInput.GroupBy(x => x.id).Select(x => x.First()).ToList();

But I'm struggling with how to merge the email lists into one for each customer. Also performance is an important factor since the data will contain 10k+ items and needs to run every hour.

I tried a lot, Select and SelectMany are good candidates but I cant wrap my head around how to merge lists, not to speak of taking this merged list back to the x.First() item.

META CODE:

dataInput
    .GroupBy(x => x.id)
    .ForEachGroup(y => group.First().Emails = MergeLists(y.Emails;)
    .Select(z => z.First()),ToList();

Expected End result C# List<>

id: a1, emails:[a,b,c,d]
id: b3, emails:[e,f]
id: k77, emails:[z,a]
David
  • 1,601
  • 3
  • 22
  • 33
  • It's better to use `.SelectMany(x => x.Take(1))` than `.Select(x => x.First())` - primarily it's more robust when you modify the query. – Enigmativity Nov 01 '21 at 21:05
  • So just to clarify: your input could have multiple items with the same `Id` value, and it can include Emails with similar values, and you want to end up with a list of objects where the `Id` values are distinct, and the `Email` collection in each entry contains only distinct email values that were associated with its `Id`? – StriplingWarrior Nov 01 '21 at 21:06
  • When you say "can contain multiple lines" do you mean "can contain multiple emails"? – Enigmativity Nov 01 '21 at 21:06
  • @StriplingWarrior Sorry for the confusion. Yes & no. Multiple items with same ID yes. Its not necessary to do a duplicate check on Emails. – David Nov 01 '21 at 21:09
  • @Enigmativity Sorry for confusion, There will only be duplicates of IDs. Not Emails – David Nov 01 '21 at 21:10
  • One email per line? – Jack A. Nov 01 '21 at 21:12
  • @JackA. One line could contains several Email(s) but mostly just one. Will add an example in the question. – David Nov 01 '21 at 21:17
  • @David - Please add the example as valid c# code. – Enigmativity Nov 01 '21 at 21:18
  • @David - Please either include the code to deserialize the JSON or post the sample as valid C# code. – Enigmativity Nov 01 '21 at 21:31

3 Answers3

3

Making some assumptions about what you mean by "merge", but does this look right?

dataInput
    .GroupBy(x => x.Id)
    .Select(g=> new Customer
        {
            Id = g.Key,
            Emails = g.SelectMany(c => c.Emails).ToArray()
        })
    .ToList();
StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • Nice! Thank you. Is there a way to not have to create a new Customer? The Customer Class has many properties and it would be sweet if I didnt have to copy each property – David Nov 01 '21 at 21:13
  • @David If the `CallToExternalService` is returning a list of customers, and that list has duplicates, and each `Customer` instance in the list has lots of properties, are all the other properties besides the email identical? – Jack A. Nov 01 '21 at 21:17
  • 1
    @David - You did say you wanted to "generate a unique list of Customers". You need to be clearer in what you ask for. – Enigmativity Nov 01 '21 at 21:17
  • @JackA. Yes, all properties are the same except for Emails – David Nov 01 '21 at 21:25
  • @StriplingWarrior: Big thanks and an upvote! – David Nov 01 '21 at 22:20
  • @David: Without knowing more about your use case, it's hard for me to give specifics, but it sounds like you've got some leaky abstractions. I would bet that with a critical look at your code you could probably find a better model than what you're using. For example, maybe each group could return an object with the first Customer as one property and the merged emails as another property. You could also consider making Customer a record and return `g.First() with {Emails = g.SelectMany(c => c.Emails).ToArray()}`. – StriplingWarrior Nov 03 '21 at 19:16
0

If you're sure that all the other properties are the same, you can use First like in your initial attempt and modify @StriplingWarrior's answer like this:

dataInput
    .GroupBy(x => x.Id)
    .Select(g => 
    {
        var customer = g.First();
        customer.Emails = g.SelectMany(c => c.Emails).ToArray();
        return customer;
    })
    .ToList();
Jack A.
  • 4,245
  • 1
  • 20
  • 34
  • 1
    That's messy. Manipulating values ***while*** iterating is generally a bad thing. – Enigmativity Nov 01 '21 at 21:38
  • @Enigmativity while that's true in some cases, this does work. – Jack A. Nov 01 '21 at 21:43
  • tried it and works like a charm! Thank you and @StriplingWarrior. I read Enigmativity critique but this is the easiest to read, and in my opinion easiest to maintain – David Nov 01 '21 at 22:20
  • 3
    @David, just for your further edification, the concept that Enigmativity was attempting to express is that it is recommended that LINQ queries do not have side effects. Here is a thread discussing this top: https://stackoverflow.com/questions/6386184/c-sharp-paradigms-side-effects-on-lists. Recommendations and best-practices exist for good reasons, but there are times when you may not want to follow them. Just be sure you have a good understanding of what you're trading off before you do. – Jack A. Nov 01 '21 at 23:18
  • 1
    As an example of what Jack's describing, suppose some day you add caching to the data layer where you get your `Customer`s. Altering those Customer objects would then cause other code paths that consume the original Customer objects to see the altered, merged customer object rather than the one that actually came out of your data store. – StriplingWarrior Nov 03 '21 at 19:20
  • @David FYI, note StriplingWarrior's comment above. This also speaks to the advantages of using immutability in your code. If your concern about re-creating the `Customer` class is the maintenance required, you could consider using something like AutoMapper. – Jack A. Nov 04 '21 at 12:51
0

If you're going to use Jack's approach, I'd suggest something slightly more robust.

var intermediate =
(
    from g in dataInput.GroupBy(c => c.Id)
    from c in g.Take(1)
    select new
    {
        customer = c,
        emails = g.SelectMany(d => d.Emails).ToArray()
    }
)
.ToArray();
    
foreach (var x in intermediate)
{
    x.customer.Emails = x.emails;
};

Customer[] ouput =
    intermediate
        .Select(x => x.customer)
        .ToArray();
Enigmativity
  • 113,464
  • 11
  • 89
  • 172
  • This does nothing. You need to assign the result of the query to a variable before you run it through `foreach`. Also, I recommend you test it, because it doesn't work as written. – Jack A. Nov 01 '21 at 21:49
  • @JackA. - Ah, yes, fair enough. It's just manipulating some of the original elements. – Enigmativity Nov 01 '21 at 21:54
  • @JackA. - Fixed, but it is now a fair bit of code. At least it's robust. – Enigmativity Nov 01 '21 at 21:57
  • 1
    The problem of manipulating _while iterating_ is only a subset of the general problem of manipulating the source objects _at all_. I'd personally recommend creating a new object that copies all of the necessary data from the first customer, rather than mutating the customer objects. – StriplingWarrior Nov 03 '21 at 19:30
  • @StriplingWarrior - Oh, absolutely! Create read-only objects is the way to go. – Enigmativity Nov 03 '21 at 20:58