Assign unique value to field in duplicate records group during groupingBy

Question

According to the reply provided by devReddit here, I did grouping of CSV records (same client names) of following test file (fake data):

CSV test file

id,name,mother,birth,center
1,Antonio Carlos da Silva,Ana da Silva, 2008/03/31,1
2,Carlos Roberto de Souza,Amália Maria de Souza,2004/12/10,1
3,Pedro de Albuquerque,Maria de Albuquerque,2006/04/03,2
4,Danilo da Silva Cardoso,Sônia de Paula Cardoso,2002/08/10,3
5,Ralfo dos Santos Filho,Helena dos Santos,2012/02/21,4
6,Pedro de Albuquerque,Maria de Albuquerque,2006/04/03,2
7,Antonio Carlos da Silva,Ana da Silva, 2008/03/31,1
8,Ralfo dos Santos Filho,Helena dos Santos,2012/02/21,4
9,Rosana Pereira de Campos,Ivana Maria de Campos,2002/07/16,3
10,Paula Cristina de Abreu,Cristina Pereira de Abreu,2014/10/25,2
11,Pedro de Albuquerque,Maria de Albuquerque,2006/04/03,2
12,Ralfo dos Santos Filho,Helena dos Santos,2012/02/21,4

Client Entity

package entities;

public class Client {

    private String id;
    private String name;
    private String mother;
    private String birth;
    private String center;
    
    public Client() {
    }

    public Client(String id, String name, String mother, String birth, String center) {
        this.id = id;
        this.name = name;
        this.mother = mother;
        this.birth = birth;
        this.center = center;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getMother() {
        return mother;
    }

    public void setMother(String mother) {
        this.mother = mother;
    }

    public String getBirth() {
        return birth;
    }

    public void setBirth(String birth) {
        this.birth = birth;
    }

    public String getCenter() {
        return center;
    }

    public void setCenter(String center) {
        this.center = center;
    }
        
    @Override
    public String toString() {
        return "Client [id=" + id + ", name=" + name + ", mother=" + mother + ", birth=" + birth + ", center=" + center
                + "]";
    }
        
}

Program

package application;
    
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.function.Function;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
    
import entities.Client;
    
public class Program {
    
    public static void main(String[] args) throws IOException {
            
        Pattern pattern = Pattern.compile(",");
            
        List<Client> file = Files.lines(Paths.get("src/Client.csv"))  
            .skip(1)
            .map(line -> { 
                String[] fields = pattern.split(line);
                return new Client(fields[0], fields[1], fields[2], fields[3], fields[4]);
            })
            .collect(Collectors.toList()); 
                        
        Map<String, List<Client>> grouped = file
            .stream()
            .filter(x -> file.stream().anyMatch(y -> isDuplicate(x, y)))
            .collect(Collectors.toList())
            .stream()
            .collect(Collectors.groupingBy(p -> p.getCenter(), LinkedHashMap::new, Collectors.mapping(Function.identity(), Collectors.toList())));

        grouped.entrySet().forEach(System.out::println);    
    }
}

private static Boolean isDuplicate(Client x, Client y) {

    return !x.getId().equals(y.getId())
    && x.getName().equals(y.getName())
    && x.getMother().equals(y.getMother())
    && x.getBirth().equals(y.getBirth());    
}

Final Result (Grouped by Center)

1=[Client [id=1, name=Antonio Carlos da Silva, mother=Ana da Silva, birth= 2008/03/31, center=1],
    Client [id=7, name=Antonio Carlos da Silva, mother=Ana da Silva, birth= 2008/03/31, center=1]]
2=[Client [id=3, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
    Client [id=5, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2],
    Client [id=6, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
    Client [id=8, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2],
    Client [id=11, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
    Client [id=12, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2]]

What I Need

I need to assign a unique value to each group of repeated records, starting over each time center value changes, even keeping the records together, since map does not guarantee this, according to the example below:

Numbers at left show the grouping by center (1 and 2). Repeated names have the same inner group number and start from "1". When the center number changes, the inner group numbers should be restarted from "1" again and so on.

    1=[Client [group=1, id=1, name=Antonio Carlos da Silva, mother=Ana da Silva, birth= 2008/03/31, center=1],
       Client [group=1, id=7, name=Antonio Carlos da Silva, mother=Ana da Silva, birth= 2008/03/31, center=1]]

 // CENTER CHANGED (2) - Restart inner group number to "1" again.

    2=[Client [group=1, id=3, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
       Client [group=1, id=6, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
       Client [group=1, id=11, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
 
// NAME CHANGED, BUT SAME CENTER YET - so increases by "1" (group=2)
      
Client [group=2, id=5, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2],
       Client [group=2, id=8, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2],
       Client [group=2, id=12, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2]]

panagiotis · Answer 1 · 2021-08-09T07:08:03.000

If I understood well, you need to sort the already grouped entries based on all three properties name, mother, and birth. You can apply such an ordering before collecting with groupingBy, using sorted:

 Map<String, List<Client>> grouped = file.stream()
                    .filter(x -> file.stream().anyMatch(y -> isDuplicate(x, y)))
                    .sorted(Comparator.comparing(Client::getName)
                                      .thenComparing(Client::getMother)
                                      .thenComparing(Client::getBirth))
                    .collect(Collectors.groupingBy(Client::getCenter));

Collectors.groupingBy internally uses Collectors.toList() as its downstream thus it preserves the ordering that you've already defined with sorted; no need for a LinkedHashMap then.

Update: To generate the groupId, you could generate it from the Client entity. Below is the updated Client:

package com.example.demo;

import java.util.Optional;

public class Client {

    private String id;
    private String name;
    private String mother;
    private String birth;
    private String center;
    private String groupId;

    public Client() {
    }

    public Client(String id, String name, String mother, String birth, String center) {
        this.id = id;
        this.name = name;
        this.mother = mother;
        this.birth = birth;
        this.center = center;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getMother() {
        return mother;
    }

    public void setMother(String mother) {
        this.mother = mother;
    }

    public String getBirth() {
        return birth;
    }

    public void setBirth(String birth) {
        this.birth = birth;
    }

    public String getCenter() {
        return center;
    }

    public void setCenter(String center) {
        this.center = center;
    }

    public Optional<String> getGroupId() {
        return Optional.ofNullable(groupId);
    }

    public void setGroupId(final String groupId) {
        this.groupId = groupId;
    }

    @Override
    public String toString() {
        return getGroupId().isPresent()
                ? "Client [groupId=" + groupId + ", id=" + id + ", name=" + name + ", mother=" + mother + ", birth=" + birth +
                ", center=" + center + "]"
                : "Client [id=" + id + ", name=" + name + ", mother=" + mother + ", birth=" + birth + ", center=" + center + "]";
    }
    
    ///
    /// Other public methods
    ///

    public Client generateAndAssignGroupId() {
        setGroupId(String.format("**group=%s**", center));
        return this;
    }
}

and the new stream:

Map<String, List<Client>> grouped = file.stream()
                .filter(x -> file.stream().anyMatch(y -> isDuplicate(x, y)))
                .sorted(Comparator.comparing(Client::getName).thenComparing(Client::getMother).thenComparing(Client::getBirth))
                .collect(Collectors.groupingBy(Client::getCenter,
                        Collectors.mapping(Client::generateAndAssignGroupId, Collectors.toList())));

Hello... Thank you so much for your reply. The first part of problem you solved !!! Is it impossible to attribute an unique value to each group of repetead records as shown in last table ? I believe that Map must be used, but I don't know how... — Adalberto José Brasaca, Aug 08 '21 at 21:57
Please check again, I updated the answer. This group id is generated whenever you create the new structure. The generation logic belongs to the Client class hence the `Client::generateAndAssignGroupId`. — panagiotis, Aug 09 '21 at 07:11
Hi Panagiotis... Thank you so much for your efforts to help me. Almost there !!! I don't think it was clear what I meant. So I edited the last part of topic (the table) with comments. I hope it's clear now. Thank you again. — Adalberto José Brasaca, Aug 09 '21 at 11:30

Gautham M · Accepted Answer · 2021-08-11T12:06:55.853

0

Instead of using file.stream within each filter, you could create a map by forming a key using the relevant fields:

A new method in Client class

public String getKey() {
    return String.format("%s~%s~%s~%s", id, name, mother, birth);
}

Use this to create a map with the count as value.

Map<String, Long> countMap = 
    file.stream()
        .map(Client::getKey)
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

Then

// For each inner group you need a separate id based on the name.
// The input would be a map with client name as the key and the
// value would be the corresponding list of clients.
// The below function returns a new map with 
// integer as the key part (required unique id for each inner group).
Function<Map<String, List<Client>>, Map<Integer, List<Client>>> mapper
    = map -> {
        AtomicInteger i = new AtomicInteger(1);
        return map.entrySet().stream()
                  .collect(Collectors.toMap(e -> i.getAndIncrement(), Map.Entry::getValue);
    };

// assuming static import of "java.util.stream.Collectors"
Map<String, Map<Integer, List<Client>>> grouped = 
    file.stream()
        .filter(x -> countMap.get(x.getKey()) > 1L) // indicates duplicate
        .collect(groupingBy(Client::getCenter,    
                            collectingAndThen(groupingBy(Client::getName, toList()),
                                              mapper /* the above function*/ )));

edited Aug 11 '21 at 12:06

answered Aug 09 '21 at 06:31

Gautham M

4,816
3
15
37

Hi... Thank you so much for your reply. Does the solution you presented restart the numbering of the internal groups when the Center number changes ? I edited the last part of topic (the table) with comments for clearer. – Adalberto José Brasaca Aug 09 '21 at 11:50
@AdalbertoJoséBrasaca Yes the `mapper` function is invoked for each `name` group. So it would restart from 1. – Gautham M Aug 09 '21 at 11:56
1

+1 for the great mapper function; @AdalbertoJoséBrasaca this solution is more efficient when you have to process large amount of data as it is of O(n) time complexity. – panagiotis Aug 09 '21 at 18:38
@GauthamM Thank you for helping me Gautham... I learn something new with your great code. []s. – Adalberto José Brasaca Aug 10 '21 at 13:41
1

@GauthamM Hello Gautham. The last part of code presented 3 errors. I edited the topic and added it to the end. Could you please take a look ? Thank you. – Adalberto José Brasaca Aug 10 '21 at 16:51
@AdalbertoJoséBrasaca One `)` was accidentally added in the answer (after `Client::getCenter`). The `collectingAndThen` is argument to the outer `groupingBy`. I have [updated](https://stackoverflow.com/posts/68707717/revisions) the answer. – Gautham M Aug 11 '21 at 04:39
@GauthamM Hi Gautham... Sorry for the inconvenience but did you get to test the code ? Another error appeared and to fix it I needed to change the Map signature, from Map>> to Map>>. However, when I ran the program the size of Map gives me "0" - grouped.size(). Any idea ? – Adalberto José Brasaca Aug 11 '21 at 11:39
@AdalbertoJoséBrasaca Yes it was tested. But as you said, I used `Integer` instead of `String` for the inner map. I tried with one of my pojo classes similar to `Client`. – Gautham M Aug 11 '21 at 11:49
@AdalbertoJoséBrasaca The issue is with the logic in `getKey` since `id` is also included, each 11 entries in the csv would be treated differently. Hence none of the elements would satisfy `filter(x -> countMap.get(x.getKey()) > 1L)` condition. You may remove `id` from `getKey` and try. – Gautham M Aug 11 '21 at 12:16
@GauthamM **YES !!! It worked the way I need it.** Thank you again Gautham and sorry for the insistence and extra work. – Adalberto José Brasaca Aug 11 '21 at 13:19

score 0 · Answer 3 · answered Mar 29 '22 at 03:07

The task requires to group the CSV file by center and sort name in each group in ascending order. The code will be very long if you try to do it in Java.

It is simple to get it done using SPL, the open-source Java package. Only one line of code is enough:

	A
1	=file("client.csv":"UTF-8").import@ct().sort(center,name).derive(ranki(name;center):group)

SPL offers JDBC driver to be invoked by Java. Just store the above SPL script as dense_rank.splx and invoke it in Java as you call a stored procedure:

…
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
st=con.prepareCall("call dense_rank ()");
st.execute();
…

Or execute the SPL string within a Java program as we execute a SQL statement:

…
st = con.prepareStatement("==file(\"client.csv\":\"UTF-8\")
     .import@ct().sort(center,name).derive(ranki(name;center):group)");
st.execute();
…

Assign unique value to field in duplicate records group during groupingBy

3 Answers3