Why is Apache commons csv parser appending unique data into 2nd result set?

Question

I have 2 CSV files (district1.csv, district2.csv) in a directory, each containing a column schoolCode. When I read both CSV files with the Apache commons CSV library, I am reading the distinct values of the schoolCode column and counting up the results. Here is my code:

public void getDistinctRecordCount() throws IOException {
        Set<String> uniqueSchools = new HashSet<>();
        int numOfSchools;
        String SchoolCode;

    //Filter to only read csv files.
    File[] files = Directory.listFiles(new FileExtensionFilter());

    for (File f : files) {
        CSVParser csvParser;
        CSVFormat csvFormat = CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim();
        reader = Files.newBufferedReader(Paths.get(Directory + "\\" + f.getName() ), StandardCharsets.ISO_8859_1);
        csvParser = CSVParser.parse(reader, csvFormat);
        for (CSVRecord column : csvParser) {
            SchoolCode = column.get("School Code");
            uniqueSchools.add(SchoolCode);
        }
        Logger.info("The list of Schools for " + f.getName() + " are: " + uniqueSchools);
        numOfSchools = uniqueSchools.size();
        Logger.info("The total count of Schools for " + f.getName() + " are: " + numOfSchools);
        Logger.info("-----------------------");
    }
}

Here is my output:

[INFO ] [Logger] - The list of Schools for district1.csv are: [01-0003-002, 01-0003-001]
[INFO ] [Logger] - The total count of Schools for district1.csv are: 2
[INFO ] [Logger] - The list of Schools for district2.csv are: [01-0003-002, 01-0003-001, 01-0018-004, 01-0018-005, 01-0018-002, 01-0018-003, 01-0018-008, 01-0018-006]
[INFO ] [Logger] - The total count of Schools for district2.csv are: 8

Problem: The two values read in from the district1.csv result are appended to the district2.csv result, throwing off my count by 2 for district2.csv (actual correct value should be 6). How is it being appended?

because you are not resetting Hashset(), you need to reset it in for loop. add following line inside for loop `uniqueSchools = new HashSet<>();` — dkb, Oct 14 '18 at 06:14

Ori Marko · Accepted Answer · 2018-10-14T06:28:43.067

If you don't need set of all schools you can just move uniqueSchools inside loop or clear it:

for (File f : files) {
   uniqueSchools.clear();

You can also save in Map<String, String> the schools per file or create a set per file, log the count and then addAll set to uniqueSchools

Set<String> currentSchools = new HashSet<>();
..
currentSchools.add(SchoolCode);
Logger.info("The list of Schools for " + f.getName() + " are: " + currentSchools);
numOfSchools = currentSchools.size();
Logger.info("The total count of Schools for " + f.getName() + " are: " + numOfSchools);        
uniqueSchools.addAll(currentSchools);

Consider lowercase (camel case) first letter of variables, e.g. change SchoolCode to schoolCode and Logger to logger

I had a feeling it was something simple. Your suggestion of clearing uniqueSchools inside the loop did the trick! Thanks so much! — ktaylor, Oct 14 '18 at 06:39

Why is Apache commons csv parser appending unique data into 2nd result set?

1 Answers1