0

I have a csv file. It contains several duplicate columns. I am trying to remove these duplicates using Java. I found Apache Common csv library, some people use it to remove duplicate rows. How can I use it to remove or skip duplicate columns?

For example: my csv header is:

ID Name Email Email

So far my code is:

Reader reader = Files.newBufferedReader(Paths.get("user.csv"));
 
            // read csv file
            Iterable<CSVRecord> records = CSVFormat.DEFAULT.withFirstRecordAsHeader()
                    .withIgnoreHeaderCase()
                    .withTrim()
                    .parse(reader);
        
            for (CSVRecord record : records) {
                System.out.println("Record #: " + record.getRecordNumber());
                System.out.println("ID: " + record.get("ID"));
                System.out.println("Name: " + record.get("Name"));
                System.out.println("Email: " + record.get("Email"));
                
            }
        
            // close the reader
            reader.close();


  • What is your code so far? – aBnormaLz Apr 27 '21 at 14:26
  • @aBnormaLz see my edit. How can I write them into a new csv file, or just update it in the original file? – duolanierduone Apr 27 '21 at 15:15
  • It appears that you know in advance which columns you want to keep, and which can be discarded. Given the column headings can have duplicate names, you can access your columns by index: [Accessing column values by index](https://commons.apache.org/proper/commons-csv/user-guide.html#Accessing_column_values_by_index). Access the ones you need - and then write out that data to a new CSV file. – andrewJames Apr 27 '21 at 18:07

1 Answers1

0

Your code is close to what you need - you just need to use CSVPrinter to write out your data to a new file.

import java.io.IOException;
import java.io.Reader;
import java.io.Writer;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVPrinter;
import org.apache.commons.csv.CSVRecord;

public class App {

    public static void main(String[] args) throws IOException {

        try (final Reader reader = Files.newBufferedReader(Paths.get("source.csv"),
                StandardCharsets.UTF_8)) {

            final Writer writer = Files.newBufferedWriter(Paths.get("target.csv"),
                    StandardCharsets.UTF_8,
                    StandardOpenOption.CREATE); // overwrites existing output file

            try (final CSVPrinter printer = CSVFormat.DEFAULT
                    .withHeader("ID", "Name", "Email")
                    .print(writer)) {
                
                // read each input file record:
                Iterable<CSVRecord> records = CSVFormat.DEFAULT
                        .withFirstRecordAsHeader()
                        .withIgnoreHeaderCase()
                        .withTrim()
                        .parse(reader);
                
                // write each output file record
                for (CSVRecord record : records) {
                    printer.print(record.get("ID"));
                    printer.print(record.get("Name"));
                    printer.print(record.get("Email"));
                    printer.println();
                }
            }
        }
    }
}

This transforms the following source file:

ID,Name,Email,Email
1,Albert,foo@bar.com,foo@bar.com
2,Brian,baz@bat.com,baz@bat.com

To this target file:

ID,Name,Email
1,Albert,foo@bar.com
2,Brian,baz@bat.com

Some points to note:

  1. I was wrong in my comment. You do not need to use column indexes - you can use headings (as I do above) in your specific case.

  2. Whenever reading and writing a file, it is recommended to provide the character encoding. In my case, I use UTF-8. (This assumes the original file was created as a URF-8 file, of course - or is compatible with that encoding.)

  3. When opening the reader and the writer I use "try-with-resources" statements. These mean I do not have to explicitly close the file resources - Java takes care of that for me.

andrewJames
  • 19,570
  • 8
  • 19
  • 51
  • @andrewjamesThanks! If user.csv file contains duplicate name, it will appear error: `Exception in thread "main" java.lang.IllegalArgumentException: The header contains a duplicate name: "Email" in [ID, Name, Email, Email]` I am not sure if you meet this error – duolanierduone Apr 28 '21 at 18:02
  • Well, yes - that would give an error. But why are you using duplicate column names like that? As you can see from my answer, this is not needed. The point of your question is to _remove duplicate columns_. – andrewJames Apr 28 '21 at 18:08
  • Yes, I didn't use the code to print same columns. When the code start running, it detects the input file contains the duplicate header, so it occurs this error. – duolanierduone Apr 28 '21 at 18:17
  • (1) Does this happen when you use my code? (2) What version of Commons CSV are you using? – andrewJames Apr 28 '21 at 18:20
  • Ah - that makes sense. Glad you resolved it. – andrewJames Apr 28 '21 at 19:04
  • Regarding "_many duplicate columns and thousands of rows_", the first step is: You need to decide some rules which can be followed. For example, If the column headings are the same, is that OK to identify a duplicate? If the data in every cell of column A matches the equivalent cell of column B, is that a duplicate? Waht if 99% of the values are the same? Is that a duplicate? These are all _business rules_ which we cannot really help you with, here, on this site. We can help you implement those rules using code, if you get stuck... – andrewJames Apr 28 '21 at 19:14
  • I mean, just like the example in your answer, if we don't know which column should be removed in advance, we only know there has duplicate column, how to find this duplicate column `Email` and remove it, then printer a new csv file – duolanierduone Apr 28 '21 at 20:06
  • What is your definition of "duplicate column"? What makes a column a duplicate of some other column? – andrewJames Apr 28 '21 at 20:12
  • same column name and same content. – duolanierduone Apr 28 '21 at 20:14
  • First you can just check the column headings. But if you find duplicate headings... Now you can no longer process one row at a time. You have to load the entire file into a collection of JavaBean objects, because you will not know if a column is a duplicate until you get to the last line of the file. But that is not really rigorous. What if you have heading `E-Mail` and heading `Email`? They are different. But all their contents are identical... This is a large and potentially complicated data cleansing issue. Far too large for a SO question. – andrewJames Apr 28 '21 at 20:24