Java functionally read .csv using openCSV, sort table headers in ascending order, and merging duplicates using reduce()

Question

I spent a ton of time figuring this out and wanted to share my answer. I think this is worthwhile to share because I am handling a complex table of data. This is my first time doing a java project trying to use Functional Programming wherever I can. Solved by looking all over Stack Overflow and piecing things together. Will be glad to get feedback on a better title, tags, and body for this question and feedback on the code too.

I am using OpenCsv to get a table of values with dates as column headers, which looks something like this:

Country	1/01/22	1/02/22	1/03/22	...
Ireland	0	5	150	...
Japan	7	189	3323	...

The numbers stand for covid cases for that date in that country.

.csv file has hundreds of columns, most of which are dates as headers. Furthermore, Country column has duplicate country names for each province of the country.

To remove duplicates in Country column, I should add up the cases by date for each province, so that I get sum of cases for all provinces of the country by date.

My attempt is down in answer section. Here's the .csv file for anyone who wants to try: https://drive.google.com/file/d/18DwzH-sse3zJXtcjLRrVCG2vasoGlCLn/view?usp=sharing

You need to clarify the question to begin with. And the answer doesn't contain any explanation, which is also not very nice. — Alexander Ivanchenko, Oct 18 '22 at 07:46
Gotcha, I'll add explanation to my answer in a while. I'll also mention that my attempt is in answer section — TenHorizons, Oct 18 '22 at 07:50

TenHorizons · Answer 1 · 2022-10-18T11:31:07.123

My attempt uses some libraries:

OpenCSV
apache.commons.collections4 (MultiValuedMap & ArrayListValuedHashMap)

To quickly learn how to use OpenCSV I recommend reading the official documentation. Took me a day to read half of it and was enough for me to know how to read from a file: https://opencsv.sourceforge.net/index.html#developer_documentation

collections4 is to support OpenCSV operations.

First step is to read csv file using OpenCSV. In my attempt I use annotations to quickly read the file into a class object.

@CsvBindByName(column="Country/Region",required=true)
private String country;
@CsvBindAndJoinByName(column="[0-9]{1,2}/[0-9]{1,2}/[0-9]{1,4}", elementType = String.class, mapType = ArrayListValuedHashMap.class)
private MultiValuedMap<String,String> casesByDate;

Firstly, because OpenCSV does not read the file in sorted order (from my knowledge autosort function is not available for MultiValuedMap), the Date Columns and Country rows will not be sorted.

My solution is to create a new variable which stores the sorted data:

private TreeMap<LocalDate, Integer> sortedCasesByDate = new TreeMap<>();

Below is the method used to populate sortedcasesBydate:

public CasesByCountry addToSortedCasesByDate(MultiValuedMap<String,String> map) {
    DateTimeFormatter dateFormat = DateTimeFormatter.ofPattern("M/d/yy");
    for(String key:map.keys()){
        sortedCasesByDate.put(LocalDate.parse(key,dateFormat),Integer.valueOf(map.get(key).toString().replaceAll("[\\[\\]]","")));
    }
    return this;
}

Full code of class file (annnotated for rows/Beans of Countries):

imports ...

public class CasesByCountry {
    @CsvBindByName(column="Country/Region",required=true)
    private String country;
    @CsvBindAndJoinByName(column="[0-9]{1,2}/[0-9]{1,2}/[0-9]{1,4}", elementType = String.class, mapType = ArrayListValuedHashMap.class)
    private MultiValuedMap<String,String> casesByDate;
    private TreeMap<LocalDate, Integer> sortedCasesByDate = new TreeMap<>();

public CasesByCountry(){}


public String getCountry() {
    return country;
}
public MultiValuedMap<String, String> getCasesByDate() {
    return casesByDate;
}
public TreeMap<LocalDate, Integer> getSortedCasesByDate() {
    return sortedCasesByDate;
}

public CasesByCountry addToSortedCasesByDate(MultiValuedMap<String,String> map) {
    DateTimeFormatter dateFormat = DateTimeFormatter.ofPattern("M/d/yy");
    for(String key:map.keys()){
        sortedCasesByDate.put(LocalDate.parse(key,dateFormat),Integer.valueOf(map.get(key).toString().replaceAll("[\\[\\]]","")));
    }
    return this;
}

//merges sortedCasesByDate for each CaseOfCountry.
//Used in reduce() by Reader to merge sortedCasesByDate of 2 provinces.
public BinaryOperator<CasesByCountry> setSortedCasesByDate = (country1,country2) ->{
    country1.getSortedCasesByDate()
            .forEach(
                    (date, numOfCases) ->
                            country1.getSortedCasesByDate()
                                    .put(
                                            date,
                                            numOfCases + country2.getSortedCasesByDate().get(date)
                                    )
            );
    return country1;
};
}

Once annotated class is completed, read file using code shared in OpenCSV docmentation. Also add processInput() to process the data later:

public static Function<String, List<CasesByCountry>> readFile = (path) -> {
    try {
        List<CasesByCountry>l = new CsvToBeanBuilder(new FileReader(path))
                .withType(CasesByCountry.class)
                .build()
                .parse();
        l = processInput.apply(l);
        l.forEach(System.out::println);
        return l;
    } catch (FileNotFoundException e) {
        throw new RuntimeException(e);
    }
};

In processInput(), Date sorting is performed. Then duplicates of countries are removed using reduce. Stack Overflow Questions I referred to to get this answer:

Java 8 stream sum entries for duplicate keys

Apply reduction only if certain condition is met

The problem with reduce is it cannot accept a condition. For example, it cannot perform the following:

if(country1.getName().equals(country2.getName()){
    //reduce()
}else{
    //go to next.
}

therefore, .groupingBy is used to create a map of Lists (Map<String,List<CaseByCountry>>). Each list has items of country duplicates. Then reduce is performed on each individual Lists<CaseByCountry> and joined together again:

/**
 * @.map: sort cases by ascending date.
 * @.groupingBy: split into lists of countries to identify duplicates.
 * @.reduce: reduce CasesByCountry by merging sortedCasesByDates TreeMaps.*/
public static UnaryOperator<List<CasesByCountry>> processInput = casesByCountryList -> {
    BinaryOperator<TreeMap<LocalDate, Integer>> mergeMaps = (Old, New) -> {
        Old.forEach((date, numOfCases) -> Old.put(date, numOfCases + New.get(date)));
        return Old;
    };

    List<CasesByCountry> toR = new ArrayList<>();
    casesByCountryList.stream().map(
            casesByCountry ->
                    casesByCountry.addToSortedCasesByDate(
                            casesByCountry.getCasesByDate()
                    )
    ).collect(
            Collectors.groupingBy(CasesByCountry::getCountry)
    ).forEach(
            (country, casesByCountry) ->
                    toR.add(casesByCountry.stream().reduce(
                            null,
                            (country1, country2) ->
                                    country1!=null
                                            ? country1.setSortedCasesByDate.apply(country1, country2)
                                            :country2
                    ))
    );
    //.sort to sort by countries.
    toR.sort(Comparator.comparing(CasesByCountry::getCountry));
    return toR;
};

Full code of Reader class:

imports...

public class Reader{
    private static List<CasesByCountry> confirmedCases;

public Reader(){
    //CaseType.CONFIRMED.getPath() is just an enum to store the file path.
    confirmedCases = readFile.apply(CaseType.CONFIRMED.getPath());
}

/**
 * @.map: sort cases by ascending date.
 * @.groupingBy: split into lists of countries to identify duplicates.
 * @.reduce: reduce CasesByCountry by merging sortedCasesByDates TreeMaps.*/
public static UnaryOperator<List<CasesByCountry>> processInput = casesByCountryList -> {
    BinaryOperator<TreeMap<LocalDate, Integer>> mergeMaps = (Old, New) -> {
        Old.forEach((date, numOfCases) -> Old.put(date, numOfCases + New.get(date)));
        return Old;
    };

    List<CasesByCountry> toR = new ArrayList<>();
    casesByCountryList.stream().map(
            casesByCountry ->
                    casesByCountry.addToSortedCasesByDate(
                            casesByCountry.getCasesByDate()
                    )
    ).collect(
            Collectors.groupingBy(CasesByCountry::getCountry)
    ).forEach(
            (country, casesByCountry) ->
                    toR.add(casesByCountry.stream().reduce(
                            null,
                            (country1, country2) ->
                                    country1!=null
                                            ? country1.setSortedCasesByDate.apply(country1, country2)
                                            :country2
                    ))
};

public static Function<String, List<CasesByCountry>> readFile = (path) -> {
    try {
        List<CasesByCountry>l = new CsvToBeanBuilder(new FileReader(path))
                .withType(CasesByCountry.class)
                .build()
                .parse();
        l = processInput.apply(l);
        l.forEach(System.out::println);
            return l;
        } catch (FileNotFoundException e) {
            throw new RuntimeException(e);
        }
    };

public List<CasesByCountry> getConfirmedCases() {
    return confirmedCases;
}

}

Java functionally read .csv using openCSV, sort table headers in ascending order, and merging duplicates using reduce()

1 Answers1