Want to compare two Lists of records, save commons to a new list ,Records are around 1M and taking a lot of time to process

Question

I'm processing 2 csv files and checking common entries and saving them into a new csv file .however the comparison is taking a lot of time.My approach is to first read all the data from files into ArrayList then using parallelStream over main list, i do comparison on the other list and append the common entries with a string builder which will then be saved to the new csv file. Below is my code for this.

allReconFileLines.parallelStream().forEach(baseLine -> {

            String[] baseLineSplitted = baseLine.split(",|,,");
            if (baseLineSplitted != null && baseLineSplitted.length >= 13 && baseLineSplitted[13].trim().equalsIgnoreCase("#N/A")) {
                for (int i = 0; i < allCompleteFileLines.size(); i++) {
                    String complteFileLine = allCompleteFileLines.get(i);
                    String[] reconLineSplitted = complteFileLine.split(",|,,");
                    if (reconLineSplitted != null && reconLineSplitted[3].replaceAll("^\"|\"$", "").trim().equals(baseLineSplitted[3].replaceAll("^\"|\"$", "").trim())) {
                        //pw.write(complteFileLine);
                        matchedLines.append(complteFileLine);
                       
                        break;
                    }
                }
            }
        });
   pw.write(matchedLines.toString());

Currently it is taking hours to process. How can i make it quick ?

You seem to loop over all entries in `allCompleteFileLines` for each entry in `allReconFileLines`, which means your time will trend towards the product of the size of both of those collections, building an index of some kind will improve this. How big are both of those lists? — Generous Badger, Jul 15 '21 at 09:32
My first attempt would be to read them both into HashMap keyed by `line[3]`, and then get their union via e.g. `map1.keySet().retainAll(map2.keySet())`.. — Christoffer Hammarström, Jul 15 '21 at 09:36
Actually better is to just read the keys of one file into e.g. a `HashSet`, and then as you're reading the second file, for each line check if it's in the set and if so write it out. This way you only need enough memory to keep the keys of one file. — Christoffer Hammarström, Jul 15 '21 at 11:25
Doing with a Hashset worked like a magic. Thanks everyone , specially @ChristofferHammarström :) — Danial Kayani, Jul 15 '21 at 18:36
@DanialKayani You're welcome! I copied my comment into an answer if you want to mark it solved. :) — Christoffer Hammarström, Jul 15 '21 at 18:42

score 2 · Accepted Answer · answered Jul 15 '21 at 18:41

2

Read the keys of one file into e.g. a HashSet, and then as you're reading the second file, for each line check if it's in the set and if so write it out. This way you only need enough memory to keep the keys of one file.

answered Jul 15 '21 at 18:41

Christoffer Hammarström

27,242
4
49
58

Want to compare two Lists of records, save commons to a new list ,Records are around 1M and taking a lot of time to process

1 Answers1