I have csv file with 5,00,000 records in it. Fields in csv file are as follows
No, Name, Address
Now i want to compare name and address from each record with name and address of all remaining records.
I was doing it in following way
List<String> lines = new ArrayList<>();
BufferedReader firstbufferedReader = new BufferedReader(new FileReader(newFile(pathname)));
while ((line = firstbufferedReader.readLine()) != null) {
lines.add(line);
}
firstbufferedReader.close();
for (int i = 0; i < lines.size(); i++)
{
csvReader = new CSVReader(new StringReader(lines.get(i)));
csvReader = null;
for (int j = i + 1; j < lines.size(); j++)
{
csvReader = new CSVReader(new StringReader(lines.get(j)));
csvReader = null;
application.linesToCompare(lines.get(i),lines.get(j));
}
}
linesToCompare Function will extract name and address from respective parameters and do comaprison. If i found records to be 80% matching(based on name and address) i am marking them as duplicates.
But my this approach is taking too much time to process that csv file.
I want a faster approach may be some kind of map reduce or anything.
Thanks in advance